[go: up one dir, main page]

Skip to content

parkchamchi/GlossySnake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GlossySnake: A Justification

We are to read to the grave. To argue on whether one's language limit his world he has to read on that matter, be it a Wikipedia article or a hardcover. Written language acquisition then becomes his interest.

Motivations for learning foreign languages.

In the 21st century one can argue that a English speaker can acquire most of the informations that matter, both ancient and contemporary. While this argument might not be too far from being true, foreign languages are worth learning for:

Language learning methods.

But how do we learn foreign language? The self-learner can use:

Also read: Language pedagogy, Wikipedia

Grammar-translation method

  • Hand the learner a heavy grammar book and dictionary.
  • Have him translate a sentence, be it about direction asking or a classical corpus written millenia ago that's hard to parse in any language.
  • The learner quits after attempting to memorize the declension and inflexion table.

This is ineffective when the target language is more complex than say English.

Direct method

Or, the Natural method. Babies imitate the language they hear, and the grammar and the additional vocabulary is learned naturally. Direct method claims that adults learning Latin is not much different from babies learning English from parents:

  • Only the target language is used.
  • Grammar is learned indirectly.
  • The learner is to be exposed to the comprehensible inputs.

The most known example of this method would be Lingua Latina per se Illustrata, or LLPSI for Latin. Instead of listing grammatical rules and giving translation exercises, Øberg presents a comprehensible story in Latin only.

The Hamiltonian method?

After learning using the direct method, with materials like LLPSI. the Italian Athenaze, L'italiano Secondo Il Metodo Natura, the learner is tempted to learn actual texts in the target language (It's not that the Direct method corpora are abundant anyway). So he reads other texts like Vulgata, L'Étranger, etc. But an obstacle arises: the vocabulary. Indexing a dictionary or searching Wiktionary is, especially when repeated, becomes but a chore.

/docs/proposal/images/example_aesop.png

  • An example of the Interlinear gloss. Æsop’s Fables, as Romanized By Phædrus: with a Literal Interlinear Translation (1832)

A straightfoward solution is the Interlinear gloss: simply, to annotate the gloss betweeen the lines. This is widely used in the linguistic papers, often with grammatical labels. But it also has been used for language learning: and it was strongly advocated by James Hamilton (1769-1829), to the point where the Hamiltonian System became synonymous with the Interlinear translation for language learning.

James Hamilton opposed the Grammar-translation method above and asserted on learning the language by the heart -- a phrase that's repeating on his book defending his system, The History, Principles, Practice, and Results of the Hamiltonian System (1829).

Hamilton argues this system -- that is more than a synonym of the Interlinear translation.

  1. To teach, instead of ordering to learn
  2. Translate at once, instead of making them get a grammar by heart. To quote for these two principles (p.7):

I taught, instead of ordering to learn; and, secondly, I taught my pupils to translate at once, instead of making them get a grammar by heart. I had tried to parse also, as well as translate, as D'Angeli had done with me, but I found this would do only with linguists: the grammar was incomprehensible at this period to the greater number of my pupils; I therefore deferred it till they had taken half the course: by that time ,as they had met in their reading all the inflexions of the verbs, and changes of the other declinable parts of speech, thousands of times, they found grammar an easy task. I then gave them two or three lectures on grammar generally, but particularly the verbs, of which I gave them a copy, and from this period my pupils read at their own home, and in class learned the use of the words they had acquired in reading they read the English Gospel of St. John into French, first after me, in precisely the same manner as I had taught them first to translate French into English, but with this essential difference, my translation into French was a free translation -- in simple but correct language, which they afterwards wrote; and in the correcting of which I gave them the details of the principles or rules of grammar, and thus taught them to write and speak correctly.

  1. Analytical translation: as opposed to the literal translations. The words are analyzed within the context.
  2. "The words of all languages have, with few exceptions, one meaning only, and should be translated generally by the same word, which should stand for its representative at all times, and all places."
  3. "The simple sounds of all languages being, with few exceptions, identically the same, it must be as easy for an Englishman to pronounce French as English, when taught, and vice versâ."

Some points are indeed arguable, but focusing on the analystic keys was insightful. As implied by my word "defending", the system was attacked by others, especially by the advocates of the Grammar-translation method.

Also quoting the Edinburgh Review, the Hamiltonian System (p. 29)

  1. teaches an unknown tongue by the closest interlinear translation, instead of leaving a boy to explore his way by the lexicon or dictionary.
  2. It postpones the study of grammar till a considerable progress has been acquired.
  3. It substitutes the cheerfulness and competition of the Lancasterian system for the dull solitude of the dictionary.

Thence the Edinburgh Review concoludes (p. 30):

The old system aims at beginning with a depth and accuracy which many men never will want, which disgusts many from arriving even at moderate attainments, and is a less easy and not more certain road to a profound skill in languages, than if attention to grammar had been deferred to a later period. In fine, we are strongly persuaded, that, the time being given, this system will make better scholars; and the degree of scholarship being given, a much shorter time will be needed. If there is any truth in this, it will make Mr. Hamilton one of the most useful men of his age; for if there is any thing which fills reflecting men with melancholy and regret, it is the waste of mortal time, parental money, and puerile happiness, in the present method of pursuing Latin and Greek.”

Hamilton, replying to this article, responds to the attacks his system (p.31):

  1. Defends the advertising: which was thought as "unfortunate". (Read the book on this matter.)
  2. Defends the lack of the competition; (see the Review's point 3.)
  3. Defends ascribing to one word one meaning only.
  4. On the guarantee of the grogress.

To quote Hamilton more (p. 39):

-- READING, whose effects mankind seem to be utterly unaware of;
-- READING, the only real -- the only effectual source of instruction;
-- READING, the pure spring of nine-tenths of our intellectual enjoyments, -- the only cure for all our ignorances;
-- READING, without which noman ever yet possessed extensive information;
-- READING, which alone constitutes the difference between the blockhead and the man of learning;
-- READING, the loss of which no knowledge of Greek particles, nor the most intimate acquaintance with the rules of syntax and prosody, will ever be able to compensate;
-- READING, the most valuable gift of the Divinity, has been sacrificed to the acquirement of what never constituted real learning, and which constitutes it now less than ever; and to the contemptible vanity of being supposed a classical scholar, often without the shadow of a title to it.

(p. 54):

But there are two objections to this improvement : first, this mode will not teach him grammar! Those who make this objection cannot see the wood for trees! to analyze a phrase word for word, to translate it by corresponding parts of speech, and to point out the grammatical construction of the phrase-the mutual dependance of all the words of a sentence on each other, is not this the very essence of grammar? Could Horace or Virgil do more? -- Ay, but the rules? Horace and Virgil knew none of these rules. But the examiners at the University do, and insist on the knowledge of them, though they do not insist on an extensive knowledge of the meaning of words.

(p.57):

if he can continue to make his pupil wade through Grammars, Exercise Books, and Dictionaries for years, for the attainment of what I have here proved may be obtained by a far easier, more certain, more effectual, more pleasing mode, in a few months?

And much more. Also read: The New Old Way of Learning Languages, The American Scholar (2008)

Back to the Interlinear gloss

We agree with Mr. Hamilton's insights, but this project disregarded some points, including the rearranged word orders. So this project's initial reference to the Hamiltonian System had to be removed. The concept of machine-generated and non-proof-read glossing would be enough to render Mr. Hamilton aghast. I did not, by any means, try to par with the quality of authentic Hamiltonian corpora.

Back to the Project

My first exposure to the Interlinear translation was the app Legentibus, which used the Interlinear transltion corpora to teach Latin.

  • If I remember correctly, one corpus was Epitome Historiæ Scaræ. Also the website had a dedicated page for the Interlinear translation, but I can't find the said page right now. If you wish to learn Latin I'll say that this app is worth $10 a month.

This method taught me that learning language can be more than going through boring grammar exercies or paying Duiolingo, Inc. for not much effective outcomes. After all, especially as with dead languages, the goal is to read. But to advance was hindered, by the limited corpora: including the paywalled texts. So as usual this became my new problem-solving project. The subobject was to read Die Leiden des jungen Werther (1774), whose copy I bought and gave up after reading a page.

  • To read the first German I'd recommend Kaufmann's bilingual edition of Faust.

The LLM.

By the time LLMs, especially ChatGPT 3.5 arose. Despite the sensationalism the robust natural language processing would help such project.

Why not use traditional machine translations, such as Google translation or simple dictionary indexing?

It was to fit the gloss by the context, au contraire de la méthode hamiltonienne.

One can argue that simple dictionary indexing would be enough, which I partly agree. (It also accords with Mr. Hamilton's view.) But the resource required for the LLM task is fairly cheap that the work required to build the vocabulary data and the indexing engine may not be lesser than the LLM's.

Instead of ChatGPT API calls, why not build a specialized model? It'd not introduce much requirements.

To be done...

API calls

First the text format has to set to be parsed by the script. By its simple requirements, instead of JSON I used simple format:

  • Input:
0: Je
1: le
2: sais.
  • Output:
0: Je || I
1: le || it
2: sais. || know.

The number and the original text were to keep the LLM reminded of the structure.

And the initial prompt was:

Parse this corpus (Interlinear gloss).

The user will tokenize and enumerate the raw input, as:
	`Je suis.`
to
```
	0: Je
	1: suis.
```

You are to respond with 
```
	i: original_word || gloss
```.
Here, the glosses are delimited with `||`.
No line should be skipped. Otherwise it will raise an error.

For example, if the gloss should be then translation to English,
the reponse shall be:
```
	0: Je || I
	1: suis. || am.
```

Since the output text is to be processed by other program,
the structure of the output is important.

The numbers should correspond to the original token.
No line shall be omitted!
```
	0: Je
	1: le
	2: sais.
```
```
	0: Je || I
	1: le || it
	2: sais. || know
```

The output should only consist of the gloss block (```...```) and any other notes will be ignored.

With an accompanied example. These long prompt was needed for the set structure, but even with this the LLM would ignore the structure.

  • I have to admit that the code wrapping the API call is a hodge-podge ad hoc's. Yet such accords with the spirit of LLM applications.

Fine-tuning of ChatGPT

To lessen the token usage and failure rates, the fine-tuning was needed. Thankfully the corpora to be grinded to the machine were all of the public domain. The texts used are:

  • Aesop's Fables as Romanized by Phaedrus with Literal Interlinear Translation (1833)
  • Eduard in Schottland, oder die Nacht eines Flüchtlings (1804)
  • Selections from the German Poets, with interlinear translations (1853)
  • Cornelius Nepos, adapted to the Hamiltonian system by an interlinear and analytical translation (189?)

And on the base model gpt-3.5-turbo and later gpt-4o-mini the JSONL of 1MB was fed. (Data) By the nature of these corpora the output introduces archaisms like "thy"s, but this would be a feature bonus.

/docs/presentation/images/sysarch.png

Comparison of the models

Time (relative) gpt-3.5-turbo gpt-4o-mini
default 100% 99.9%
fine-tuned 87.8 % 88.8%

Used the 24 poems of Winterrise, and used the median value.

  • gpt-4o-mini is 5 times cheaper.
  • Fine-tuned models also use less tokens.

The corpus preprocessing.

Read: the design doc

Now that the core was planned, I looked from the outside. The User (or, I) will put the text, and it has to be preprocessed that is optmized for the interlinear translation. This is a crucial point of the project -- by this it gains the reason-to-be, that can be more than asking gpt-4o-mini "Make a interlinear translation of this corpus". It has to be reusable. The obvious choise was JSON.

/docs/design/images/class_serializable.png

I just like drawing diagrams.

Simply:

  • A Token, that is a word, will be annotated with a gloss.
  • Tokens form a Paragraph, which is a unit to be fed to the annotator.
  • Paragraphs form a Corpus.

/docs/proposal/images/proposal_diag.png

So:

  • The User inputs the text, (or a processed Corpus JSON, as in the sequence diagram)
  • The Backend transforms it into the Corpus to be manipulated.
  • The Parser:
    • Divides the Corpus into Paragraph. For prose one "\n", and for poem "\n\n".
    • Parses the Paragraph into Tokens. There may be some considerations: such as languages without spacings (e.g. Japanese) and those with particles (e.g. Korean). While this can be handled fairly easily with the NLP libraries, since the targetted source languages (French, German, Latin) need not such considerations, I went with the reliable re.split() with string.whitespace.
  • The Annotator: calls the API and put the gloss to the Paragraphs...

/docs/design/images/class_manipulators.png /docs/design/images/class_req_options.png

The Backend.

Used Django. See the design doc for the endpoints.

/docs/design/images/er_dj_serializables.png

Deployment

Deployed on AWS, with the domain glossysnake.com (soon to be on https...)

The TODOs

  • SSL
  • Not actual accounts
  • No convenient token usage tracking
  • Not parallelized; too slow to be used.
  • etc., etc., etc. It's not yet for the production.

The frontend.

Used Vue.js 3.

Since the project's goal is to nicely wrap what LLM API gives, the frontend development was not less important than others. I wanna thank GitHub Copilot for helping me write the Vue code; to make a confession, I hate what GenAI outputs for its soul-devoidness. What is devoid of the human soul is not worth beholding. From this reason I don't take the GenAI-will-replace-the-human-art people seriously. Still tasks like this benefit from GenAI-as-a-tool -- Even in my project the text GenAI generates supports the original human text. So I could write the functioning frontend in days that is transformed from a 1200-line pure Javascript code that I made to test the backend API. LLM does a good job for such framework-chores. But I have to hate those who use the machine-output code that they can't explain, when such can be done trivially by the very machine.

The Applications.

Forget what's written above. Why would you care that the text is broken down into Corpus, Paragraph, and Token? Hell, how else would it be implemented as? What matters is the following applications -- and that any corpus can be machine-glossed automatically likewise.

These are included on the frontend.

Winterreise

/docs/presentation/images/out_fremd.png

You can complain about the introduced barbarism to the original Text. I encourage you to learn German, by this method or any others, so you don't need such tools. In fact I want this tool to be forgotten after it's used.

One point: I actually don't speak German much, so I can't guarantee how accurate the gloss would be. But I speak a bit of French.

Le papillon et le fleur

/docs/presentation/images/out_papillon.png

Pretty okay. The literal translations like "Not free not!", "Thou thyself goest!", "Yet we we love," are expected; after all what to be read is French, not English.

  • I'm planning an additional project centering the Lieder and mélodies.

Remarks.

In the age of the FadAI and unfortunately much more to come the importance to hold onto the classical corpora is only increasing. With the acquisition of the Languages (French, German, Latin and Greek) we can develop immunity against the mass- and machine-generated slops.

  • Only through good written souls one can find his soul's meaning.

Chanjin Park a.k.a. "Chamchi"

2024.09.21

A LaTeX Proof-of-Concept

/src/tools/latex/werther.png

The PDF file of Die Leiden des jungen Werther (1774)

2024.09.29

From here

The project has achieved its initial goal, and I plan to go further:

To a serverless service

The current codebase is backend-heavy and is not sustainable. I've migrated the annotator to the frontend side so it can be more versatile. I also plan to reform the frontend.

Code cleanup

The Python annotator code and the translated Javascript was not much changed from the Proof-of-Concept code. This inefficient and undocumented codebase became hard to maintain. I plan to rewrite the code in more structured way.

"Natural Language Processing"

I have to disclose, albeit it may be obvious, that I'm not versed with the NLP. Parsing of the ChatGPT output (badly) works for now but I plan to go forward too. I'm (re-)learning NLP again and hope to test a fitting approach. There appears to be many papers regarding the application of NLP to the interlinear gloss, espeically for philological interests, so I'd suggest to myself to get in touch with 'em.

24.10.16

Form here #2

After modifying the frontend code it now fits my use, but to advance:

Replace the current Corpus model

The current naive JSON model proved to be computer-heavy. I'm thinking of changing it into XML.

More structured Frontend structure

The current monolothic indexDB approach too is too heavy.

& much more. 24.10.22