UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128) #2108

rudysterner · 2024-08-02T08:43:52Z

Have you searched existing issues? 🔎

I have searched and found no existing issues

Desribe the bug

Hello author, thank you for sharing this. I am having some problems with the code you provided and would like to ask you about it. There are two tasks, task A succeeds but task B fails, the error message is like this: UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128) Looking forward to your reply! (Note: Task A has 9995 texts, Task B has more than 36000 texts)

The error is reported as follows:

{
"name": "UnicodeEncodeError",
"message": "'ascii' codec can't encode characters in position 18-20: ordinal not in range(128)",
"stack": "---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
Cell In[22], line 2
1 # 查看主题
----> 2 topics, probs = topic_model.fit_transform(docs, embeddings=embeddings) #传入训练好的词向量
3 topic_model.get_topic_info()

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\bertopic\_bertopic.py:389, in BERTopic.fit_transform(self, documents, embeddings, images, y)
386 umap_embeddings = self._reduce_dimensionality(embeddings, y)
388 # Cluster reduced embeddings
--> 389 documents, probabilities = self._cluster_embeddings(umap_embeddings, documents, y=y)
391 # Sort and Map Topic IDs by their frequency
392 if not self.nr_topics:

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\bertopic\_bertopic.py:3218, in BERTopic._cluster_embeddings(self, umap_embeddings, documents, partial_fit, y)
3216 else:
3217 try:
-> 3218 self.hdbscan_model.fit(umap_embeddings, y=y)
3219 except TypeError:
3220 self.hdbscan_model.fit(umap_embeddings)

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\hdbscan\hdbscan_.py:1205, in HDBSCAN.fit(self, X, y)
1195 kwargs.pop("prediction_data", None)
1196 kwargs.update(self.metric_kwargs)
1198 (
1199 self.labels,
1200 self.probabilities_,
1201 self.cluster_persistence_,
1202 self._condensed_tree,
1203 self._single_linkage_tree,
1204 self._min_spanning_tree,
-> 1205 ) = hdbscan(clean_data, **kwargs)
1207 if self.metric != "precomputed" and not self._all_finite:
1208 # remap indices to align with original data in the case of non-finite entries.
1209 self._condensed_tree = remap_condensed_tree(
1210 self._condensed_tree, internal_to_raw, outliers
1211 )

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\hdbscan\hdbscan_.py:837, in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, max_cluster_size, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
824 (single_linkage_tree, result_min_span_tree) = memory.cache(
825 _hdbscan_prims_kdtree
826 )(
(...)
834 **kwargs
835 )
836 else:
--> 837 (single_linkage_tree, result_min_span_tree) = memory.cache(
838 _hdbscan_boruvka_kdtree
839 )(
840 X,
841 min_samples,
842 alpha,
843 metric,
844 p,
845 leaf_size,
846 approx_min_span_tree,
847 gen_min_span_tree,
848 core_dist_n_jobs,
849 **kwargs
850 )
851 else: # Metric is a valid BallTree metric
852 # TO DO: Need heuristic to decide when to go to boruvka;
853 # still debugging for now
854 if X.shape[1] > 60:

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\memory.py:312, in NotMemorizedFunc.call(self, *args, **kwargs)
311 def call(self, *args, **kwargs):
--> 312 return self.func(*args, **kwargs)

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\hdbscan\hdbscan_.py:340, in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
337 X = X.astype(np.float64)
339 tree = KDTree(X, metric=metric, leaf_size=leaf_size, **kwargs)
--> 340 alg = KDTreeBoruvkaAlgorithm(
341 tree,
342 min_samples,
343 metric=metric,
344 leaf_size=leaf_size // 3,
345 approx_min_span_tree=approx_min_span_tree,
346 n_jobs=core_dist_n_jobs,
347 **kwargs
348 )
349 min_spanning_tree = alg.spanning_tree()
350 # Sort edges of the min_spanning_tree by weight

File hdbscan\_hdbscan_boruvka.pyx:392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init()

File hdbscan\_hdbscan_boruvka.pyx:426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\parallel.py:1909, in Parallel.call(self, iterable)
1906 self._start_time = time.time()
1908 if not self._managed_backend:
-> 1909 n_jobs = self._initialize_backend()
1910 else:
1911 n_jobs = self._effective_n_jobs()

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\parallel.py:1359, in Parallel._initialize_backend(self)
1357 """Build a process or thread pool and return the number of workers"""
1358 try:
-> 1359 n_jobs = self._backend.configure(n_jobs=self.n_jobs, parallel=self,
1360 **self._backend_args)
1361 if self.timeout is not None and not self._backend.supports_timeout:
1362 warnings.warn(
1363 'The backend class {!r} does not support timeout. '
1364 "You have set 'timeout={}' in Parallel but "
1365 "the 'timeout' parameter will not be used.".format(
1366 self._backend.class.name,
1367 self.timeout))

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\_parallel_backends.py:538, in LokyBackend.configure(self, n_jobs, parallel, prefer, require, idle_worker_timeout, **memmappingexecutor_args)
534 if n_jobs == 1:
535 raise FallbackToBackend(
536 SequentialBackend(nesting_level=self.nesting_level))
--> 538 self._workers = get_memmapping_executor(
539 n_jobs, timeout=idle_worker_timeout,
540 env=self._prepare_worker_env(n_jobs=n_jobs),
541 context_id=parallel._id, **memmappingexecutor_args)
542 self.parallel = parallel
543 return n_jobs

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\executor.py:20, in get_memmapping_executor(n_jobs, **kwargs)
19 def get_memmapping_executor(n_jobs, **kwargs):
---> 20 return MemmappingExecutor.get_memmapping_executor(n_jobs, **kwargs)

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\executor.py:42, in MemmappingExecutor.get_memmapping_executor(cls, n_jobs, timeout, initializer, initargs, env, temp_folder, context_id, **backend_args)
39 reuse = _executor_args is None or _executor_args == executor_args
40 _executor_args = executor_args
---> 42 manager = TemporaryResourcesManager(temp_folder)
44 # reducers access the temporary folder in which to store temporary
45 # pickles through a call to manager.resolve_temp_folder_name. resolving
46 # the folder name dynamically is useful to use different folders across
47 # calls of a same reusable executor
48 job_reducers, result_reducers = get_memmapping_reducers(
49 unlink_on_gc_collect=True,
50 temp_folder_resolver=manager.resolve_temp_folder_name,
51 **backend_args)

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\_memmapping_reducer.py:540, in TemporaryResourcesManager.init(self, temp_folder_root, context_id)
534 if context_id is None:
535 # It would be safer to not assign a default context id (less silent
536 # bugs), but doing this while maintaining backward compatibility
537 # with the previous, context-unaware version get_memmaping_executor
538 # exposes too many low-level details.
539 context_id = uuid4().hex
--> 540 self.set_current_context(context_id)

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\_memmapping_reducer.py:544, in TemporaryResourcesManager.set_current_context(self, context_id)
542 def set_current_context(self, context_id):
543 self._current_context_id = context_id
--> 544 self.register_new_context(context_id)

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\memmapping_reducer.py:569, in TemporaryResourcesManager.register_new_context(self, context_id)
562 new_folder_name = (
563 "joblib_memmapping_folder{}{}{}".format(
564 os.getpid(), self._id, context_id)
565 )
566 new_folder_path, _ = _get_temp_dir(
567 new_folder_name, self._temp_folder_root
568 )
--> 569 self.register_folder_finalizer(new_folder_path, context_id)
570 self._cached_temp_folders[context_id] = new_folder_path

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\_memmapping_reducer.py:585, in TemporaryResourcesManager.register_folder_finalizer(self, pool_subfolder, context_id)
578 def register_folder_finalizer(self, pool_subfolder, context_id):
579 # Register the garbage collector at program exit in case caller forgets
580 # to call terminate explicitly: note we do not pass any reference to
581 # ensure that this callback won't prevent garbage collection of
582 # parallel instance and related file handler resources such as POSIX
583 # semaphores and pipes
584 pool_module_name = whichmodule(delete_folder, 'delete_folder')
--> 585 resource_tracker.register(pool_subfolder, "folder")
587 def _cleanup():
588 # In some cases the Python runtime seems to set delete_folder to
589 # None just before exiting when accessing the delete_folder
(...)
594 # because joblib should only use relative imports to allow
595 # easy vendoring.
596 delete_folder = import(
597 pool_module_name, fromlist=['delete_folder']
598 ).delete_folder

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\externals\loky\backend\resource_tracker.py:179, in ResourceTracker.register(self, name, rtype)
177 """Register a named resource, and increment its refcount."""
178 self.ensure_running()
--> 179 self._send("REGISTER", name, rtype)

File d:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\externals\loky\backend\resource_tracker.py:196, in ResourceTracker._send(self, cmd, name, rtype)
192 if len(name) > 512:
193 # posix guarantees that writes to a pipe of less than PIPE_BUF
194 # bytes are atomic, and that PIPE_BUF >= 512
195 raise ValueError("name too long")
--> 196 msg = f"{cmd}:{name}:{rtype}
".encode("ascii")
197 nbytes = os.write(self._fd, msg)
198 assert nbytes == len(msg)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128)"
}

Reproduction

1.task A：
import numpy as np
from bertopic import BERTopic
from transformers.pipelines import pipeline
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer


with open(r'D:\zhuhongchang\python_study\萝卜快跑\2.数据预处理\4.分词\微博内容_切词.txt', 'r', encoding='utf-8') as file:
  docs = file.readlines()
print('条数: ', len(docs))
print('预览第一条: ', docs[0])

vectorizer_model = None

# 1. 词向量模型，同时加载本地训练好的词向量
#embedding_model = pipeline("feature-extraction", model="bert-base-chinese") # 使用bert-base-chinese
#embeddings = np.load(R'C:\Users\李书智\Downloads\BBC.npy') # 使用bert-base-chinese向量
#print('向量shape：', embeddings.shape)

# 替换: 使用hfl模型
# embedding_model = pipeline("feature-extraction", model="hfl/chinese-bert-wwm")
# embeddings = np.load('C:\Users\李书智\Downloads\BBC\emb.npy') 
# print('向量shape：', embeddings.shape)

# 替换: 使用Sentencetransformers模型
embedding_model = embedding_model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2',)
embeddings = np.load(r'D:\zhuhongchang\python_study\萝卜快跑\3.数据处理\1.文本转向量\词向量setence-transformer.npy')
print(embeddings.shape)

# 2. 创建UMAP降维模型
umap_model = UMAP(
  n_neighbors=15,
  n_components=5,
  min_dist=0.0,
  metric='cosine',
  random_state=30  # ⚠️ 防止随机 https://maartengr.github.io/BERTopic/faq.html
)

# 3. 创建HDBSCAN聚类模型
# 如果要建设离群值，可以减小下面两个参数min_cluster_size min_samples
# https://hdbscan.readthedocs.io/en/latest/faq.html
hdbscan_model = HDBSCAN(
  min_cluster_size=50,
  min_samples=50,
  metric='euclidean'
)

# 5. 创建CountVectorizer模型
vectorizer_model = CountVectorizer(stop_words=['洛阳', '旅游', '文化'])

# 6. 正式创建BERTopic模型
topic_model = BERTopic(
  embedding_model=embedding_model,
  vectorizer_model=vectorizer_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
)

# 查看主题
topics, probs = topic_model.fit_transform(docs, embeddings=embeddings) #传入训练好的词向量
topic_model.get_topic_info()

2.task B：
import numpy as np
from bertopic import BERTopic
from transformers.pipelines import pipeline
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer

with open(r'C:\Users\李书智\Desktop\切词后.txt', 'r', encoding='utf-8') as file:
  docs = file.readlines()
print('条数: ', len(docs))
print('预览第一条: ', docs[0])

vectorizer_model = None

# 1. 词向量模型，同时加载本地训练好的词向量
#embedding_model = pipeline("feature-extraction", model="bert-base-chinese") # 使用bert-base-chinese
#embeddings = np.load(R'C:\Users\李书智\Downloads\BBC.npy') # 使用bert-base-chinese向量
#print('向量shape：', embeddings.shape)

# 替换: 使用hfl模型
# embedding_model = pipeline("feature-extraction", model="hfl/chinese-bert-wwm")
# embeddings = np.load('C:\Users\李书智\Downloads\BBC\emb.npy') 
# print('向量shape：', embeddings.shape)

# 替换: 使用Sentencetransformers模型
embedding_model = embedding_model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2',)
embeddings = np.load(r'C:\Users\李书智\Desktop\STweiboNEIRONG.npy')
print(embeddings.shape)

# 2. 创建UMAP降维模型
umap_model = UMAP(
  n_neighbors=15,
  n_components=5,
  min_dist=0.0,
  metric='cosine',
  random_state=30  # ⚠️ 防止随机 https://maartengr.github.io/BERTopic/faq.html
)

# 3. 创建HDBSCAN聚类模型
# 如果要建设离群值，可以减小下面两个参数min_cluster_size min_samples
# https://hdbscan.readthedocs.io/en/latest/faq.html
hdbscan_model = HDBSCAN(
  min_cluster_size=50,
  min_samples=50,
  metric='euclidean'
)

# 5. 创建CountVectorizer模型
vectorizer_model = CountVectorizer(stop_words=['洛阳', '旅游', '文化'])

# 6. 正式创建BERTopic模型
topic_model = BERTopic(
  embedding_model=embedding_model,
  vectorizer_model=vectorizer_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
)

# 查看主题
topics, probs = topic_model.fit_transform(docs, embeddings=embeddings) #传入训练好的词向量
topic_model.get_topic_info()


### BERTopic Version

unknow

MaartenGr · 2024-08-02T08:56:54Z

Is this a duplicate of #2107? To me it seems like you created two issues for the same thing.

With respect to your issue, I'm not entirely sure what the problem is considering this seems to relate to HDBSCAN which operates on numerical values and not string-based information.

Looking at the full error message, it seems like it attempts to print some sort of message relating to the folder you are working in. So it might be easiest so make the path you are working in fully alphanumeric to prevent any issues.

--> 196 msg = f"{cmd}:{name}:{rtype}
".encode("ascii")

The above seems where it goes wrong and I believe parts of its information is retrieved from the folder name:

resource_tracker.register(pool_subfolder, "folder")

As such, it seems to have issues with the name of the subfolder that you are currently using. Updating that to include alphanumeric characters might solve your issue (although I'm not entirely sure, I'm simply reading through the error log).

rudysterner · 2024-08-02T09:25:23Z

Thank you for your timely reply, I set the file path to English and numeric and still have the above problem. I would be grateful if you could help me to see what the problem is. Does the problem have anything to do with the file itself?

wrong tips are as flowing：
{
"name": "UnicodeEncodeError",
"message": "'ascii' codec can't encode characters in position 18-20: ordinal not in range(128)",
"stack": "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mUnicodeEncodeError\u001b[0m Traceback (most recent call last)\nCell \u001b[1;32mIn[9], line 2\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;66;03m# 查看主题\u001b[39;00m\n\u001b[1;32m----> 2\u001b[0m topics, probs \u001b[38;5;241m=\u001b[39m \u001b[43mtopic_model\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfit_transform\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdocs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43membeddings\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43membeddings\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;66;03m#传入训练好的词向量\u001b[39;00m\n\u001b[0;32m 3\u001b[0m topic_model\u001b[38;5;241m.\u001b[39mget_topic_info()\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\bertopic\bertopic.py:389\u001b[0m, in \u001b[0;36mBERTopic.fit_transform\u001b[1;34m(self, documents, embeddings, images, y)\u001b[0m\n\u001b[0;32m 386\u001b[0m umap_embeddings \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_reduce_dimensionality(embeddings, y)\n\u001b[0;32m 388\u001b[0m \u001b[38;5;66;03m# Cluster reduced embeddings\u001b[39;00m\n\u001b[1;32m--> 389\u001b[0m documents, probabilities \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_cluster_embeddings\u001b[49m\u001b[43m(\u001b[49m\u001b[43mumap_embeddings\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdocuments\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43my\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 391\u001b[0m \u001b[38;5;66;03m# Sort and Map Topic IDs by their frequency\u001b[39;00m\n\u001b[0;32m 392\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnr_topics:\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\bertopic\bertopic.py:3218\u001b[0m, in \u001b[0;36mBERTopic.cluster_embeddings\u001b[1;34m(self, umap_embeddings, documents, partial_fit, y)\u001b[0m\n\u001b[0;32m 3216\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 3217\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m-> 3218\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mhdbscan_model\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfit\u001b[49m\u001b[43m(\u001b[49m\u001b[43mumap_embeddings\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43my\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 3219\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n\u001b[0;32m 3220\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mhdbscan_model\u001b[38;5;241m.\u001b[39mfit(umap_embeddings)\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\hdbscan\hdbscan.py:1205\u001b[0m, in \u001b[0;36mHDBSCAN.fit\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 1195\u001b[0m kwargs\u001b[38;5;241m.\u001b[39mpop(\u001b[38;5;124m"\u001b[39m\u001b[38;5;124mprediction_data\u001b[39m\u001b[38;5;124m"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m)\n\u001b[0;32m 1196\u001b[0m kwargs\u001b[38;5;241m.\u001b[39mupdate(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_metric_kwargs)\n\u001b[0;32m 1198\u001b[0m (\n\u001b[0;32m 1199\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mlabels,\n\u001b[0;32m 1200\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mprobabilities,\n\u001b[0;32m 1201\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcluster_persistence_,\n\u001b[0;32m 1202\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_condensed_tree,\n\u001b[0;32m 1203\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_single_linkage_tree,\n\u001b[0;32m 1204\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_min_spanning_tree,\n\u001b[1;32m-> 1205\u001b[0m ) \u001b[38;5;241m=\u001b[39m hdbscan(clean_data, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[0;32m 1207\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmetric \u001b[38;5;241m!=\u001b[39m \u001b[38;5;124m"\u001b[39m\u001b[38;5;124mprecomputed\u001b[39m\u001b[38;5;124m"\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_all_finite:\n\u001b[0;32m 1208\u001b[0m \u001b[38;5;66;03m# remap indices to align with original data in the case of non-finite entries.\u001b[39;00m\n\u001b[0;32m 1209\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_condensed_tree \u001b[38;5;241m=\u001b[39m remap_condensed_tree(\n\u001b[0;32m 1210\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_condensed_tree, internal_to_raw, outliers\n\u001b[0;32m 1211\u001b[0m )\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\hdbscan\hdbscan_.py:837\u001b[0m, in \u001b[0;36mhdbscan\u001b[1;34m(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, max_cluster_size, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, kwargs)\u001b[0m\n\u001b[0;32m 824\u001b[0m (single_linkage_tree, result_min_span_tree) \u001b[38;5;241m=\u001b[39m memory\u001b[38;5;241m.\u001b[39mcache(\n\u001b[0;32m 825\u001b[0m _hdbscan_prims_kdtree\n\u001b[0;32m 826\u001b[0m )(\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 834\u001b[0m \u001b[38;5;241m\u001b[39m\u001b[38;5;241m\u001b[39mkwargs\n\u001b[0;32m 835\u001b[0m )\n\u001b[0;32m 836\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m--> 837\u001b[0m (single_linkage_tree, result_min_span_tree) \u001b[38;5;241m=\u001b[39m memory\u001b[38;5;241m.\u001b[39mcache(\n\u001b[0;32m 838\u001b[0m hdbscan_boruvka_kdtree\n\u001b[0;32m 839\u001b[0m )(\n\u001b[0;32m 840\u001b[0m X,\n\u001b[0;32m 841\u001b[0m min_samples,\n\u001b[0;32m 842\u001b[0m alpha,\n\u001b[0;32m 843\u001b[0m metric,\n\u001b[0;32m 844\u001b[0m p,\n\u001b[0;32m 845\u001b[0m leaf_size,\n\u001b[0;32m 846\u001b[0m approx_min_span_tree,\n\u001b[0;32m 847\u001b[0m gen_min_span_tree,\n\u001b[0;32m 848\u001b[0m core_dist_n_jobs,\n\u001b[0;32m 849\u001b[0m \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs\n\u001b[0;32m 850\u001b[0m )\n\u001b[0;32m 851\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m: \u001b[38;5;66;03m# Metric is a valid BallTree metric\u001b[39;00m\n\u001b[0;32m 852\u001b[0m \u001b[38;5;66;03m# TO DO: Need heuristic to decide when to go to boruvka;\u001b[39;00m\n\u001b[0;32m 853\u001b[0m \u001b[38;5;66;03m# still debugging for now\u001b[39;00m\n\u001b[0;32m 854\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m X\u001b[38;5;241m.\u001b[39mshape[\u001b[38;5;241m1\u001b[39m] \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m60\u001b[39m:\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\memory.py:312\u001b[0m, in \u001b[0;36mNotMemorizedFunc.call\u001b[1;34m(self, *args, **kwargs)\u001b[0m\n\u001b[0;32m 311\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m__call_\u001b[39m(\u001b[38;5;28mself\u001b[39m, \u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs):\n\u001b[1;32m--> 312\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfunc(\u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\hdbscan\hdbscan_.py:340\u001b[0m, in \u001b[0;36m_hdbscan_boruvka_kdtree\u001b[1;34m(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, kwargs)\u001b[0m\n\u001b[0;32m 337\u001b[0m X \u001b[38;5;241m=\u001b[39m X\u001b[38;5;241m.\u001b[39mastype(np\u001b[38;5;241m.\u001b[39mfloat64)\n\u001b[0;32m 339\u001b[0m tree \u001b[38;5;241m=\u001b[39m KDTree(X, metric\u001b[38;5;241m=\u001b[39mmetric, leaf_size\u001b[38;5;241m=\u001b[39mleaf_size, \u001b[38;5;241m\u001b[39m\u001b[38;5;241m\u001b[39mkwargs)\n\u001b[1;32m--> 340\u001b[0m alg \u001b[38;5;241m=\u001b[39m KDTreeBoruvkaAlgorithm(\n\u001b[0;32m 341\u001b[0m tree,\n\u001b[0;32m 342\u001b[0m min_samples,\n\u001b[0;32m 343\u001b[0m metric\u001b[38;5;241m=\u001b[39mmetric,\n\u001b[0;32m 344\u001b[0m leaf_size\u001b[38;5;241m=\u001b[39mleaf_size \u001b[38;5;241m/\u001b[39m\u001b[38;5;241m/\u001b[39m \u001b[38;5;241m3\u001b[39m,\n\u001b[0;32m 345\u001b[0m approx_min_span_tree\u001b[38;5;241m=\u001b[39mapprox_min_span_tree,\n\u001b[0;32m 346\u001b[0m n_jobs\u001b[38;5;241m=\u001b[39mcore_dist_n_jobs,\n\u001b[0;32m 347\u001b[0m \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs\n\u001b[0;32m 348\u001b[0m )\n\u001b[0;32m 349\u001b[0m min_spanning_tree \u001b[38;5;241m=\u001b[39m alg\u001b[38;5;241m.\u001b[39mspanning_tree()\n\u001b[0;32m 350\u001b[0m \u001b[38;5;66;03m# Sort edges of the min_spanning_tree by weight\u001b[39;00m\n\nFile \u001b[1;32mhdbscan\_hdbscan_boruvka.pyx:392\u001b[0m, in \u001b[0;36mhdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init\u001b[1;34m()\u001b[0m\n\nFile \u001b[1;32mhdbscan\hdbscan_boruvka.pyx:426\u001b[0m, in \u001b[0;36mhdbscan.hdbscan_boruvka.KDTreeBoruvkaAlgorithm.compute_bounds\u001b[1;34m()\u001b[0m\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\parallel.py:1909\u001b[0m, in \u001b[0;36mParallel.call\u001b[1;34m(self, iterable)\u001b[0m\n\u001b[0;32m 1906\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_start_time \u001b[38;5;241m=\u001b[39m time\u001b[38;5;241m.\u001b[39mtime()\n\u001b[0;32m 1908\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_managed_backend:\n\u001b[1;32m-> 1909\u001b[0m n_jobs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_initialize_backend\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1910\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 1911\u001b[0m n_jobs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_effective_n_jobs()\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\parallel.py:1359\u001b[0m, in \u001b[0;36mParallel.initialize_backend\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 1357\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m"""Build a process or thread pool and return the number of workers"""\u001b[39;00m\n\u001b[0;32m 1358\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m-> 1359\u001b[0m n_jobs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backend\u001b[38;5;241m.\u001b[39mconfigure(n_jobs\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mn_jobs, parallel\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m,\n\u001b[0;32m 1360\u001b[0m \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backend_args)\n\u001b[0;32m 1361\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtimeout \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backend\u001b[38;5;241m.\u001b[39msupports_timeout:\n\u001b[0;32m 1362\u001b[0m warnings\u001b[38;5;241m.\u001b[39mwarn(\n\u001b[0;32m 1363\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mThe backend class \u001b[39m\u001b[38;5;132;01m{!r}\u001b[39;00m\u001b[38;5;124m does not support timeout. \u001b[39m\u001b[38;5;124m'\u001b[39m\n\u001b[0;32m 1364\u001b[0m \u001b[38;5;124m"\u001b[39m\u001b[38;5;124mYou have set \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mtimeout=\u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m in Parallel but \u001b[39m\u001b[38;5;124m"\u001b[39m\n\u001b[0;32m 1365\u001b[0m \u001b[38;5;124m"\u001b[39m\u001b[38;5;124mthe \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mtimeout\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m parameter will not be used.\u001b[39m\u001b[38;5;124m"\u001b[39m\u001b[38;5;241m.\u001b[39mformat(\n\u001b[0;32m 1366\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backend\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__class\u001b[39m\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__name\u001b[39m,\n\u001b[0;32m 1367\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mtimeout))\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\_parallel_backends.py:538\u001b[0m, in \u001b[0;36mLokyBackend.configure\u001b[1;34m(self, n_jobs, parallel, prefer, require, idle_worker_timeout, memmappingexecutor_args)\u001b[0m\n\u001b[0;32m 534\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m n_jobs \u001b[38;5;241m==\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[0;32m 535\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m FallbackToBackend(\n\u001b[0;32m 536\u001b[0m SequentialBackend(nesting_level\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mnesting_level))\n\u001b[1;32m--> 538\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_workers \u001b[38;5;241m=\u001b[39m get_memmapping_executor(\n\u001b[0;32m 539\u001b[0m n_jobs, timeout\u001b[38;5;241m=\u001b[39midle_worker_timeout,\n\u001b[0;32m 540\u001b[0m env\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_prepare_worker_env(n_jobs\u001b[38;5;241m=\u001b[39mn_jobs),\n\u001b[0;32m 541\u001b[0m context_id\u001b[38;5;241m=\u001b[39mparallel\u001b[38;5;241m.\u001b[39m_id, \u001b[38;5;241m\u001b[39m\u001b[38;5;241m\u001b[39mmemmappingexecutor_args)\n\u001b[0;32m 542\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mparallel \u001b[38;5;241m=\u001b[39m parallel\n\u001b[0;32m 543\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m n_jobs\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\executor.py:20\u001b[0m, in \u001b[0;36mget_memmapping_executor\u001b[1;34m(n_jobs, kwargs)\u001b[0m\n\u001b[0;32m 19\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mget_memmapping_executor\u001b[39m(n_jobs, \u001b[38;5;241m\u001b[39m\u001b[38;5;241m\u001b[39mkwargs):\n\u001b[1;32m---> 20\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m MemmappingExecutor\u001b[38;5;241m.\u001b[39mget_memmapping_executor(n_jobs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\executor.py:42\u001b[0m, in \u001b[0;36mMemmappingExecutor.get_memmapping_executor\u001b[1;34m(cls, n_jobs, timeout, initializer, initargs, env, temp_folder, context_id, backend_args)\u001b[0m\n\u001b[0;32m 39\u001b[0m reuse \u001b[38;5;241m=\u001b[39m _executor_args \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;129;01mor\u001b[39;00m _executor_args \u001b[38;5;241m==\u001b[39m executor_args\n\u001b[0;32m 40\u001b[0m _executor_args \u001b[38;5;241m=\u001b[39m executor_args\n\u001b[1;32m---> 42\u001b[0m manager \u001b[38;5;241m=\u001b[39m \u001b[43mTemporaryResourcesManager\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtemp_folder\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 44\u001b[0m \u001b[38;5;66;03m# reducers access the temporary folder in which to store temporary\u001b[39;00m\n\u001b[0;32m 45\u001b[0m \u001b[38;5;66;03m# pickles through a call to manager.resolve_temp_folder_name. resolving\u001b[39;00m\n\u001b[0;32m 46\u001b[0m \u001b[38;5;66;03m# the folder name dynamically is useful to use different folders across\u001b[39;00m\n\u001b[0;32m 47\u001b[0m \u001b[38;5;66;03m# calls of a same reusable executor\u001b[39;00m\n\u001b[0;32m 48\u001b[0m job_reducers, result_reducers \u001b[38;5;241m=\u001b[39m get_memmapping_reducers(\n\u001b[0;32m 49\u001b[0m unlink_on_gc_collect\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m,\n\u001b[0;32m 50\u001b[0m temp_folder_resolver\u001b[38;5;241m=\u001b[39mmanager\u001b[38;5;241m.\u001b[39mresolve_temp_folder_name,\n\u001b[0;32m 51\u001b[0m \u001b[38;5;241m\u001b[39m\u001b[38;5;241m\u001b[39mbackend_args)\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\memmapping_reducer.py:540\u001b[0m, in \u001b[0;36mTemporaryResourcesManager.init\u001b[1;34m(self, temp_folder_root, context_id)\u001b[0m\n\u001b[0;32m 534\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m context_id \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m 535\u001b[0m \u001b[38;5;66;03m# It would be safer to not assign a default context id (less silent\u001b[39;00m\n\u001b[0;32m 536\u001b[0m \u001b[38;5;66;03m# bugs), but doing this while maintaining backward compatibility\u001b[39;00m\n\u001b[0;32m 537\u001b[0m \u001b[38;5;66;03m# with the previous, context-unaware version get_memmaping_executor\u001b[39;00m\n\u001b[0;32m 538\u001b[0m \u001b[38;5;66;03m# exposes too many low-level details.\u001b[39;00m\n\u001b[0;32m 539\u001b[0m context_id \u001b[38;5;241m=\u001b[39m uuid4()\u001b[38;5;241m.\u001b[39mhex\n\u001b[1;32m--> 540\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mset_current_context\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcontext_id\u001b[49m\u001b[43m)\u001b[49m\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\memmapping_reducer.py:544\u001b[0m, in \u001b[0;36mTemporaryResourcesManager.set_current_context\u001b[1;34m(self, context_id)\u001b[0m\n\u001b[0;32m 542\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mset_current_context\u001b[39m(\u001b[38;5;28mself\u001b[39m, context_id):\n\u001b[0;32m 543\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_current_context_id \u001b[38;5;241m=\u001b[39m context_id\n\u001b[1;32m--> 544\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mregister_new_context\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcontext_id\u001b[49m\u001b[43m)\u001b[49m\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\memmapping_reducer.py:569\u001b[0m, in \u001b[0;36mTemporaryResourcesManager.register_new_context\u001b[1;34m(self, context_id)\u001b[0m\n\u001b[0;32m 562\u001b[0m new_folder_name \u001b[38;5;241m=\u001b[39m (\n\u001b[0;32m 563\u001b[0m \u001b[38;5;124m"\u001b[39m\u001b[38;5;124mjoblib_memmapping_folder\u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m\u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m\u001b[39m\u001b[38;5;132;01m{}\u001b[39;00m\u001b[38;5;124m"\u001b[39m\u001b[38;5;241m.\u001b[39mformat(\n\u001b[0;32m 564\u001b[0m os\u001b[38;5;241m.\u001b[39mgetpid(), \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_id, context_id)\n\u001b[0;32m 565\u001b[0m )\n\u001b[0;32m 566\u001b[0m new_folder_path, _ \u001b[38;5;241m=\u001b[39m get_temp_dir(\n\u001b[0;32m 567\u001b[0m new_folder_name, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_temp_folder_root\n\u001b[0;32m 568\u001b[0m )\n\u001b[1;32m--> 569\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mregister_folder_finalizer\u001b[49m\u001b[43m(\u001b[49m\u001b[43mnew_folder_path\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcontext_id\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 570\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_cached_temp_folders[context_id] \u001b[38;5;241m=\u001b[39m new_folder_path\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\memmapping_reducer.py:585\u001b[0m, in \u001b[0;36mTemporaryResourcesManager.register_folder_finalizer\u001b[1;34m(self, pool_subfolder, context_id)\u001b[0m\n\u001b[0;32m 578\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mregister_folder_finalizer\u001b[39m(\u001b[38;5;28mself\u001b[39m, pool_subfolder, context_id):\n\u001b[0;32m 579\u001b[0m \u001b[38;5;66;03m# Register the garbage collector at program exit in case caller forgets\u001b[39;00m\n\u001b[0;32m 580\u001b[0m \u001b[38;5;66;03m# to call terminate explicitly: note we do not pass any reference to\u001b[39;00m\n\u001b[0;32m 581\u001b[0m \u001b[38;5;66;03m# ensure that this callback won't prevent garbage collection of\u001b[39;00m\n\u001b[0;32m 582\u001b[0m \u001b[38;5;66;03m# parallel instance and related file handler resources such as POSIX\u001b[39;00m\n\u001b[0;32m 583\u001b[0m \u001b[38;5;66;03m# semaphores and pipes\u001b[39;00m\n\u001b[0;32m 584\u001b[0m pool_module_name \u001b[38;5;241m=\u001b[39m whichmodule(delete_folder, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mdelete_folder\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m--> 585\u001b[0m \u001b[43mresource_tracker\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mregister\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpool_subfolder\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m"\u001b[39;49m\u001b[38;5;124;43mfolder\u001b[39;49m\u001b[38;5;124;43m"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[0;32m 587\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_cleanup\u001b[39m():\n\u001b[0;32m 588\u001b[0m \u001b[38;5;66;03m# In some cases the Python runtime seems to set delete_folder to\u001b[39;00m\n\u001b[0;32m 589\u001b[0m \u001b[38;5;66;03m# None just before exiting when accessing the delete_folder\u001b[39;00m\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 594\u001b[0m \u001b[38;5;66;03m# because joblib should only use relative imports to allow\u001b[39;00m\n\u001b[0;32m 595\u001b[0m \u001b[38;5;66;03m# easy vendoring.\u001b[39;00m\n\u001b[0;32m 596\u001b[0m delete_folder \u001b[38;5;241m=\u001b[39m \u001b[38;5;28m__import\u001b[39m(\n\u001b[0;32m 597\u001b[0m pool_module_name, fromlist\u001b[38;5;241m=\u001b[39m[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mdelete_folder\u001b[39m\u001b[38;5;124m'\u001b[39m]\n\u001b[0;32m 598\u001b[0m )\u001b[38;5;241m.\u001b[39mdelete_folder\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\externals\loky\backend\resource_tracker.py:179\u001b[0m, in \u001b[0;36mResourceTracker.register\u001b[1;34m(self, name, rtype)\u001b[0m\n\u001b[0;32m 177\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m"""Register a named resource, and increment its refcount."""\u001b[39;00m\n\u001b[0;32m 178\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mensure_running()\n\u001b[1;32m--> 179\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_send\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m"\u001b[39;49m\u001b[38;5;124;43mREGISTER\u001b[39;49m\u001b[38;5;124;43m"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mname\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mrtype\u001b[49m\u001b[43m)\u001b[49m\n\nFile \u001b[1;32md:\zhuhongchang\anaconda\envs\bertopic_evo\lib\site-packages\joblib\externals\loky\backend\resource_tracker.py:196\u001b[0m, in \u001b[0;36mResourceTracker._send\u001b[1;34m(self, cmd, name, rtype)\u001b[0m\n\u001b[0;32m 192\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(name) \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m512\u001b[39m:\n\u001b[0;32m 193\u001b[0m \u001b[38;5;66;03m# posix guarantees that writes to a pipe of less than PIPE_BUF\u001b[39;00m\n\u001b[0;32m 194\u001b[0m \u001b[38;5;66;03m# bytes are atomic, and that PIPE_BUF >= 512\u001b[39;00m\n\u001b[0;32m 195\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m"\u001b[39m\u001b[38;5;124mname too long\u001b[39m\u001b[38;5;124m"\u001b[39m)\n\u001b[1;32m--> 196\u001b[0m msg \u001b[38;5;241m=\u001b[39m \u001b[38;5;124;43mf\u001b[39;49m\u001b[38;5;124;43m"\u001b[39;49m\u001b[38;5;132;43;01m{\u001b[39;49;00m\u001b[43mcmd\u001b[49m\u001b[38;5;132;43;01m}\u001b[39;49;00m\u001b[38;5;124;43m:\u001b[39;49m\u001b[38;5;132;43;01m{\u001b[39;49;00m\u001b[43mname\u001b[49m\u001b[38;5;132;43;01m}\u001b[39;49;00m\u001b[38;5;124;43m:\u001b[39;49m\u001b[38;5;132;43;01m{\u001b[39;49;00m\u001b[43mrtype\u001b[49m\u001b[38;5;132;43;01m}\u001b[39;49;00m\u001b[38;5;130;43;01m\n\u001b[39;49;00m\u001b[38;5;124;43m"\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mencode\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m"\u001b[39;49m\u001b[38;5;124;43mascii\u001b[39;49m\u001b[38;5;124;43m"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[0;32m 197\u001b[0m nbytes \u001b[38;5;241m=\u001b[39m os\u001b[38;5;241m.\u001b[39mwrite(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_fd, msg)\n\u001b[0;32m 198\u001b[0m \u001b[38;5;28;01massert\u001b[39;00m nbytes \u001b[38;5;241m==\u001b[39m \u001b[38;5;28mlen\u001b[39m(msg)\n\n\u001b[1;31mUnicodeEncodeError\u001b[0m: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128)"
}

MaartenGr · 2024-08-02T09:56:03Z

Are you sure the full path to the .py or notebook you work in (including the file itself) is alphanumeric? If so, then I would advise sharing this issue on the joblib and HDBSCAN repositories since the issue does not seem to relate to BERTopic but joblib specifically (I think).

Also, note that I cannot read the error message with the way you shared it. Make sure to properly format it and make it more readable if you intend to share it in the above repositories.

rudysterner · 2024-08-02T18:21:36Z

Hello, I seem to have found a little bit of a clue, I put 9100 pieces of information from the first part (post-cut text + word vectors) into the bertipic model and it works. Similarly, I put 12106 messages (post-cut text + word vectors) from part 2 into the bertipic model and it works. However, when I combine the corpus information of part 1 and part 2 into 21206 messages and put them (cut text + word vectors) into the bertopic model, it reports an error. Why is this? I am looking forward to a reply!

rudysterner · 2024-08-02T18:23:35Z

The error message is still: UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128)

MaartenGr · 2024-08-03T06:49:37Z

As mentioned, I believe this issue relates to trying to access temporary folders that were created which might only be the case if data is sufficiently large. Either way, and based on the original error log, it does not seem to relate to BERTopic but rather it's underlying algorithm HDBSCANwhich makes use of joblib. I would advise also posting the issue on the respective repositories since I do not think there is much I can do from the BERTopic side of the code, unfortunately.

If you want to be sure what the source is of the issue, you can change the following in Joblib:

msg = f"{cmd}:{name}:{rtype}".encode("ascii")

to

msg = f"{cmd}:{name}:{rtype}".encode("utf-8")
print(msg)

But again, even if we know the output of that, I'm not sure whether it can be solved from within BERTopic.

rudysterner added the bug Something isn't working label Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128) #2108

UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128) #2108

rudysterner commented Aug 2, 2024

MaartenGr commented Aug 2, 2024

rudysterner commented Aug 2, 2024

MaartenGr commented Aug 2, 2024

rudysterner commented Aug 2, 2024

rudysterner commented Aug 2, 2024

MaartenGr commented Aug 3, 2024

UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128) #2108

UnicodeEncodeError: 'ascii' codec can't encode characters in position 18-20: ordinal not in range(128) #2108

Comments

rudysterner commented Aug 2, 2024

Have you searched existing issues? 🔎

Desribe the bug

Reproduction

MaartenGr commented Aug 2, 2024

rudysterner commented Aug 2, 2024

MaartenGr commented Aug 2, 2024

rudysterner commented Aug 2, 2024

rudysterner commented Aug 2, 2024

MaartenGr commented Aug 3, 2024