-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Representations from representation_models are generated twice when nr_topics="auto". #2144
Comments
Edit 2: I succesfully ran the test now. I will check how to do the PR, maybe Monday next week Edit: I managed to run the tests, and its failing for edge cases. Seems mostly when the found topics are already less than the nr_topics to be reduced to. looking into it Hello! I will probably be coming back to this project so I gave this a shot, since it was flagged as a good first issue, and I think I solved it (at least for my purpose) . I don't know how to run a comprehensive test though. I can continue with the PR if I get some guidance on my this question, as I only tested for my data and parameters Is the I modified the topic extraction in fit_transform adding a
Then and then
Another approach would have involved changing the way It may be simpler doing the latter, but I didn't have the confidence or knowledge about the library to figure that out. |
@pipa666 Thank you for considering tackling this! If I'm not mistaken, I believe the main changes that need to be made are in That way, I think you would only need to adjust |
Wouldn't that affect when other instances are calling _extract_topics or _extract_words_per_topic? That's why I ended up adding another argument similar to I agree that it looks more messy now. I tried different approaches but never succeeded in keeping the It was challenging because of the way the |
It might indeed. I checked your code and it surprises me why just the BERTopic/bertopic/_bertopic.py Lines 492 to 493 in 0b4265a
to something like this: # Extract topics by calculating c-TF-IDF
if self.nr_topics:
self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose, calculate_aspects=False)
else:
self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose) |
But in that case, it would still calculate the "main" representation (which could be an LLM prompt) As far as I understand, the BERTopic/bertopic/_bertopic.py Lines 4294 to 4306 in 0b4265a
Other option is, instead of adding an extra argument, to extend the Edit: I made it a bit cleaner on the fit_transform
|
Ah, my apologies, I meant using # Extract topics by calculating c-TF-IDF
if self.nr_topics:
self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose, calculate_representation=False)
else:
self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose) If we approach it like the above (which is the only change needed in I don't really understand why it is necessary to do more changes than the above. |
I agree, I will try. But there is an issue when dealing with the case of If I manage the exception in
then, when running pytest, the tests throw errors for the edge cases Here is the error: model = 'online_topic_model', documents = ['\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pe...e practical, and the experimenters were\nprobably sure that it was 5 milliseconds, not 4 or 6 either.\n\n\nSteve', ...]
I can try to keep looking on this further on the week. |
I'm not sure if I understand correctly. Why are you adding that additional else statement here: if isinstance(self.nr_topics, int):
if self.nr_topics < initial_nr_topics:
documents = self._reduce_to_n_topics(documents, use_ctfidf)
else:
logger.info(f"Topic reduction - Number of topics ({self.nr_topics}) is equal or higher than the clustered topics({len(self.get_topics())}).")
documents = self._sort_mappings_by_frequency(documents)
self._extract_topics(documents, verbose=self.verbose)
return documents
elif isinstance(self.nr_topics, str):
documents = self._auto_reduce_topics(documents, use_ctfidf)
else:
raise ValueError("nr_topics needs to be an int or 'auto'! ") It shouldn't do anything if BERTopic/bertopic/_bertopic.py Lines 4350 to 4356 in 0b4265a
I believe there is no need to actually adapt |
If we don't have that adaptation the following happens:
Since the original code calculated the representation already at step 2), and thus originating the current issue, this exception handling wasn't needed. But now, that representations are not calculated before attempting to reduce, then we need to add the I hope this might help explain my logic. |
It does! I might have misunderstood the purpose of the |
It does, but that wasn't true for the So do you find the solution agreeable? If so, I can submit a PR request with the recommended format I am still trying to solve the problem described here though. Which I find perplexing. |
Yeah, assuming we simplify a bunch of the code that I'm seeing here: pipa666@4ce0b1a I'm hesitant to add so much overhead for what is in essence a rather straightforward feature, especially if we would just do it by applying |
Ok, I came back to this and solved it as
This keeps the fit_transform() clean. I added some logger messsages to keep better track of the process and whether topics are being reduced or not. I noted an issue that will appear in edge cases, which was related with my problem in running successful test (in particular for As it is in my commit, when the This is probably not a critical concern since its an edge case, and topic information is still displayed sorted by frequencies, just that their topic index wont match their frequencies rank. |
@pipa666 Awesome, this already looks much cleaner! I think this should be ready for a pull request. There are a couple of questions/suggestions that I have but I think we can continue that discussion in the PR.
Yeah, users wanting more topics than the clustering gives back actually happens quite often and indeed should be captured. What about calculating
I think it will be as I have seen many users getting to this edge case unfortunately. |
Discussed in #2142
Originally posted by siddharthtumre September 10, 2024
Hi @MaartenGr
While using nr_topics="auto", I can see that the representations from representation_models are calculated after reducing the number of topics. This increases the run time as a LLM is prompted twice i.e., before and after reducing topics.
I believe generating representations after reducing the topics should be sufficient.
That's indeed what is happening currently. I believe it should be straightforward to implement that so this might be a good first issue for those wanting to help out.
For now, what you could also do is run the model first without any representation models and then use .update_topics without your selected representation model (e.g, LLM) to create the representations. Then the LLM would run only once and you get the same results.
The text was updated successfully, but these errors were encountered: