-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Would you be open to a lightweight vectorizer/embedder? #768
Comments
Yep, you can pass custom embeddings to
Just to be sure, the bloom hack that you refer to is this one, right?
Definitely! I think this would appeal to those wanting a faster, CPU-based approach. Moreover, seeing as BERTopic is opting for as much modularity as possible, it makes sense to also provide more options focusing on speed whilst still providing good enough results.
Yes, I'm imaging something like this if it makes sense to generalize it to any scikit-learn pipelines: from bertopic.backend import BaseEmbedder
from sklearn.utils.validation import check_is_fitted, NotFittedError
class SklearnEmbedder(BaseEmbedder):
def __init__(self, pipe):
super().__init__()
self.pipe = pipe
def embed(self, documents, verbose=False):
try:
check_is_fitted(self.pipe)
embeddings = self.pipe.transform(documents)
except NotFittedError:
embeddings = self.pipe.fit_transform(documents)
return embeddings
custom_embedder = SklearnEmbedder(pipe)
topic_model = BERTopic(embedding_model=custom_embedder) Just tried the above on the 20 NewsGroups dataset and from my subjective perspective evaluating the output, I am quite impressed with how similar they seem compared to something like There is one thing to note though. BERTopic was initially built around pre-trained language models, meaning that it was assumed that those models would only need a single method ( Those are definitely not breaking issues and most features will run without any problems but it is something to take into account when opting for this back-end. |
If you remove the The baby has a big bad diaper now. But I'll come back to this in due time. |
How about this; to keep things simple, how about I make a PR for this feature with a big segment in the docs that explains some of the caveats? |
Sure, that would be great! |
The datasets that I have tend to be ~80K examples and just running the embeddings on a CPU takes ~40 minutes.
I have, however, a trick up my sleeve.
This pipeline combines a hashing trick, with a bloom hack, with a sparse pca trick and a tf-idf trick. One benefit is that this is orders of magnitude faster to embed; even when you include training.
It's just orders of magnitude faster. So maybe it'd be nice to have these embeddings around?
But what about the quality of the embeddings?
Mileage can vary, sure, but I have some results here that suggest it's certainly not the worst idea either. When you compare the UMAP chart on top of tf/idf with the universal sentence encoder one then, sure ... the USE variant is intuitively better, but given the speedup, I might argue that the tf/idf approach is reasonable too.
There's a fair bit of tuning involved, and I'm contemplating a library that implements bloom vectorizers properly for scikit-learn. But once that is done and once I've done some benchmarking, would this library be recipient to such an embedder?
The text was updated successfully, but these errors were encountered: