Workshops of natural language processing
-
Updated
Jan 6, 2021 - Jupyter Notebook
Workshops of natural language processing
Escape unknown symbols in SentecePiece vocabularies
A framework for building Sentencepiece tokenizer from a dataset
Bengali SentencePiece Model created with wiki dump data.
SentencePiece model parser generated from the SentencePiece protobuf definition.
Temp fork to provide Python 3.13 macOS wheels ahead of official project releases
dataset, train, inference
Use SentencePiece in Swift for tokenization and detokenization.
Unsupervised text tokenizer for Neural Network-based text generation.
An Industry Standard Tokenizer, purposed for large-scale language models like OpenAI's GPT Series.
pretrained models and a training code for sentencepiece
Free and open source pre-trained translation models, including Kurdish, Samoan, Xhosa, Lao, Corsican, Cebuano, Galician, Yiddish, Swahili, and Yoruba.
NMT with RNN Models: (1) in Vanilla style, (2) with Sentencepiece, (3) using Pre-trained models from FairSeq
Fast and versatile tokenizer for language models with BPE, Unigram and WordPiece tokenization. Compatible with SentencePiece, Tokenizers, Tiktoken and more.
A huggingface space for Sugoi V4
한글을 영어로 번역하는 자연어처리 모델 스터디입니다.
Bengali language Tokenizer (SentencePiece)
Add a description, image, and links to the sentencepiece topic page so that developers can more easily learn about it.
To associate your repository with the sentencepiece topic, visit your repo's landing page and select "manage topics."