-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128) #2108
Comments
Is this a duplicate of #2107? To me it seems like you created two issues for the same thing. With respect to your issue, I'm not entirely sure what the problem is considering this seems to relate to HDBSCAN which operates on numerical values and not string-based information. Looking at the full error message, it seems like it attempts to print some sort of message relating to the folder you are working in. So it might be easiest so make the path you are working in fully alphanumeric to prevent any issues. --> 196 msg = f"{cmd}:{name}:{rtype}
".encode("ascii") The above seems where it goes wrong and I believe parts of its information is retrieved from the folder name: resource_tracker.register(pool_subfolder, "folder") As such, it seems to have issues with the name of the subfolder that you are currently using. Updating that to include alphanumeric characters might solve your issue (although I'm not entirely sure, I'm simply reading through the error log). |
Thank you for your timely reply, I set the file path to English and numeric and still have the above problem. I would be grateful if you could help me to see what the problem is. Does the problem have anything to do with the file itself? wrong tips are as flowing: |
Are you sure the full path to the .py or notebook you work in (including the file itself) is alphanumeric? If so, then I would advise sharing this issue on the joblib and HDBSCAN repositories since the issue does not seem to relate to BERTopic but joblib specifically (I think). Also, note that I cannot read the error message with the way you shared it. Make sure to properly format it and make it more readable if you intend to share it in the above repositories. |
Hello, I seem to have found a little bit of a clue, I put 9100 pieces of information from the first part (post-cut text + word vectors) into the bertipic model and it works. Similarly, I put 12106 messages (post-cut text + word vectors) from part 2 into the bertipic model and it works. However, when I combine the corpus information of part 1 and part 2 into 21206 messages and put them (cut text + word vectors) into the bertopic model, it reports an error. Why is this? I am looking forward to a reply! |
The error message is still: UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128) |
As mentioned, I believe this issue relates to trying to access temporary folders that were created which might only be the case if data is sufficiently large. Either way, and based on the original error log, it does not seem to relate to BERTopic but rather it's underlying algorithm HDBSCANwhich makes use of joblib. I would advise also posting the issue on the respective repositories since I do not think there is much I can do from the BERTopic side of the code, unfortunately. If you want to be sure what the source is of the issue, you can change the following in Joblib: msg = f"{cmd}:{name}:{rtype}".encode("ascii") to msg = f"{cmd}:{name}:{rtype}".encode("utf-8")
print(msg) But again, even if we know the output of that, I'm not sure whether it can be solved from within BERTopic. |
Have you searched existing issues? 🔎
Desribe the bug
Hello author, thank you for sharing this. I am having some problems with the code you provided and would like to ask you about it. There are two tasks, task A succeeds but task B fails, the error message is like this: UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128) Looking forward to your reply! (Note: Task A has 9995 texts, Task B has more than 36000 texts)
The error is reported as follows:
{
"name": "UnicodeEncodeError",
"message": "'ascii' codec can't encode characters in position 18-20: ordinal not in range(128)",
"stack": "---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
Cell In[22], line 2
1 # 查看主题
----> 2 topics, probs = topic_model.fit_transform(docs, embeddings=embeddings) #传入训练好的词向量
3 topic_model.get_topic_info()
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\bertopic\_bertopic.py:389, in BERTopic.fit_transform(self, documents, embeddings, images, y)
386 umap_embeddings = self._reduce_dimensionality(embeddings, y)
388 # Cluster reduced embeddings
--> 389 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
391 # Sort and Map Topic IDs by their frequency
392 if not self.nr_topics:
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\bertopic\_bertopic.py:3218, in BERTopic._cluster_embeddings(self, umap_embeddings, documents, partial_fit, y)
3216 else:
3217 try:
-> 3218 self.hdbscan_model.fit(umap_embeddings, y=y)
3219 except TypeError:
3220 self.hdbscan_model.fit(umap_embeddings)
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\hdbscan\hdbscan_.py:1205, in HDBSCAN.fit(self, X, y)
1195 kwargs.pop("prediction_data", None)
1196 kwargs.update(self.metric_kwargs)
1198 (
1199 self.labels,
1200 self.probabilities_,
1201 self.cluster_persistence_,
1202 self._condensed_tree,
1203 self._single_linkage_tree,
1204 self._min_spanning_tree,
-> 1205 ) = hdbscan(clean_data, **kwargs)
1207 if self.metric != "precomputed" and not self._all_finite:
1208 # remap indices to align with original data in the case of non-finite entries.
1209 self._condensed_tree = remap_condensed_tree(
1210 self._condensed_tree, internal_to_raw, outliers
1211 )
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\hdbscan\hdbscan_.py:837, in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, max_cluster_size, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
824 (single_linkage_tree, result_min_span_tree) = memory.cache(
825 _hdbscan_prims_kdtree
826 )(
(...)
834 **kwargs
835 )
836 else:
--> 837 (single_linkage_tree, result_min_span_tree) = memory.cache(
838 _hdbscan_boruvka_kdtree
839 )(
840 X,
841 min_samples,
842 alpha,
843 metric,
844 p,
845 leaf_size,
846 approx_min_span_tree,
847 gen_min_span_tree,
848 core_dist_n_jobs,
849 **kwargs
850 )
851 else: # Metric is a valid BallTree metric
852 # TO DO: Need heuristic to decide when to go to boruvka;
853 # still debugging for now
854 if X.shape[1] > 60:
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\memory.py:312, in NotMemorizedFunc.call(self, *args, **kwargs)
311 def call(self, *args, **kwargs):
--> 312 return self.func(*args, **kwargs)
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\hdbscan\hdbscan_.py:340, in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
337 X = X.astype(np.float64)
339 tree = KDTree(X, metric=metric, leaf_size=leaf_size, **kwargs)
--> 340 alg = KDTreeBoruvkaAlgorithm(
341 tree,
342 min_samples,
343 metric=metric,
344 leaf_size=leaf_size // 3,
345 approx_min_span_tree=approx_min_span_tree,
346 n_jobs=core_dist_n_jobs,
347 **kwargs
348 )
349 min_spanning_tree = alg.spanning_tree()
350 # Sort edges of the min_spanning_tree by weight
File hdbscan\_hdbscan_boruvka.pyx:392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init()
File hdbscan\_hdbscan_boruvka.pyx:426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\parallel.py:1909, in Parallel.call(self, iterable)
1906 self._start_time = time.time()
1908 if not self._managed_backend:
-> 1909 n_jobs = self._initialize_backend()
1910 else:
1911 n_jobs = self._effective_n_jobs()
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\parallel.py:1359, in Parallel._initialize_backend(self)
1357 """Build a process or thread pool and return the number of workers"""
1358 try:
-> 1359 n_jobs = self._backend.configure(n_jobs=self.n_jobs, parallel=self,
1360 **self._backend_args)
1361 if self.timeout is not None and not self._backend.supports_timeout:
1362 warnings.warn(
1363 'The backend class {!r} does not support timeout. '
1364 "You have set 'timeout={}' in Parallel but "
1365 "the 'timeout' parameter will not be used.".format(
1366 self._backend.class.name,
1367 self.timeout))
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\_parallel_backends.py:538, in LokyBackend.configure(self, n_jobs, parallel, prefer, require, idle_worker_timeout, **memmappingexecutor_args)
534 if n_jobs == 1:
535 raise FallbackToBackend(
536 SequentialBackend(nesting_level=self.nesting_level))
--> 538 self._workers = get_memmapping_executor(
539 n_jobs, timeout=idle_worker_timeout,
540 env=self._prepare_worker_env(n_jobs=n_jobs),
541 context_id=parallel._id, **memmappingexecutor_args)
542 self.parallel = parallel
543 return n_jobs
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\executor.py:20, in get_memmapping_executor(n_jobs, **kwargs)
19 def get_memmapping_executor(n_jobs, **kwargs):
---> 20 return MemmappingExecutor.get_memmapping_executor(n_jobs, **kwargs)
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\executor.py:42, in MemmappingExecutor.get_memmapping_executor(cls, n_jobs, timeout, initializer, initargs, env, temp_folder, context_id, **backend_args)
39 reuse = _executor_args is None or _executor_args == executor_args
40 _executor_args = executor_args
---> 42 manager = TemporaryResourcesManager(temp_folder)
44 # reducers access the temporary folder in which to store temporary
45 # pickles through a call to manager.resolve_temp_folder_name. resolving
46 # the folder name dynamically is useful to use different folders across
47 # calls of a same reusable executor
48 job_reducers, result_reducers = get_memmapping_reducers(
49 unlink_on_gc_collect=True,
50 temp_folder_resolver=manager.resolve_temp_folder_name,
51 **backend_args)
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\_memmapping_reducer.py:540, in TemporaryResourcesManager.init(self, temp_folder_root, context_id)
534 if context_id is None:
535 # It would be safer to not assign a default context id (less silent
536 # bugs), but doing this while maintaining backward compatibility
537 # with the previous, context-unaware version get_memmaping_executor
538 # exposes too many low-level details.
539 context_id = uuid4().hex
--> 540 self.set_current_context(context_id)
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\_memmapping_reducer.py:544, in TemporaryResourcesManager.set_current_context(self, context_id)
542 def set_current_context(self, context_id):
543 self._current_context_id = context_id
--> 544 self.register_new_context(context_id)
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\memmapping_reducer.py:569, in TemporaryResourcesManager.register_new_context(self, context_id)
562 new_folder_name = (
563 "joblib_memmapping_folder{}{}{}".format(
564 os.getpid(), self._id, context_id)
565 )
566 new_folder_path, _ = _get_temp_dir(
567 new_folder_name, self._temp_folder_root
568 )
--> 569 self.register_folder_finalizer(new_folder_path, context_id)
570 self._cached_temp_folders[context_id] = new_folder_path
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\_memmapping_reducer.py:585, in TemporaryResourcesManager.register_folder_finalizer(self, pool_subfolder, context_id)
578 def register_folder_finalizer(self, pool_subfolder, context_id):
579 # Register the garbage collector at program exit in case caller forgets
580 # to call terminate explicitly: note we do not pass any reference to
581 # ensure that this callback won't prevent garbage collection of
582 # parallel instance and related file handler resources such as POSIX
583 # semaphores and pipes
584 pool_module_name = whichmodule(delete_folder, 'delete_folder')
--> 585 resource_tracker.register(pool_subfolder, "folder")
587 def _cleanup():
588 # In some cases the Python runtime seems to set delete_folder to
589 # None just before exiting when accessing the delete_folder
(...)
594 # because joblib should only use relative imports to allow
595 # easy vendoring.
596 delete_folder = import(
597 pool_module_name, fromlist=['delete_folder']
598 ).delete_folder
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\externals\loky\backend\resource_tracker.py:179, in ResourceTracker.register(self, name, rtype)
177 """Register a named resource, and increment its refcount."""
178 self.ensure_running()
--> 179 self._send("REGISTER", name, rtype)
File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\externals\loky\backend\resource_tracker.py:196, in ResourceTracker._send(self, cmd, name, rtype)
192 if len(name) > 512:
193 # posix guarantees that writes to a pipe of less than PIPE_BUF
194 # bytes are atomic, and that PIPE_BUF >= 512
195 raise ValueError("name too long")
--> 196 msg = f"{cmd}:{name}:{rtype}
".encode("ascii")
197 nbytes = os.write(self._fd, msg)
198 assert nbytes == len(msg)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128)"
}
Reproduction
The text was updated successfully, but these errors were encountered: