Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cryptical error msg for duplicates in entities #135

Open
acxcv opened this issue Sep 26, 2022 · 5 comments
Open

Cryptical error msg for duplicates in entities #135

acxcv opened this issue Sep 26, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@acxcv
Copy link

acxcv commented Sep 26, 2022

🐛 Bug

When trying to create embeddings for a custom list of DBPedia entities using RDF2VecTransformer.fit_transform, I'm encountering the following bug in RDF2VecTransformer._update:

Part 1:
File "/r2venv/lib/python3.9/site-packages/pyrdf2vec/rdf2vec.py", line 271, in _update
attr[pos] = tmp.pop(self._pos_walks[i]) IndexError: list assignment index out of range

Because attr[pos] = tmp.pop(self._pos_walks[i] tries to assign a value to an empty list, attr, at index pos, I tried changing it to attr.insert(pos, tmp.pop(self._pos_walks[i])). This populates the attrs list but then I run into another error:

Part 2:
File "/r2venv/lib/python3.9/site-packages/pyrdf2vec/rdf2vec.py", line 271, in _update
tmp.pop(self._pos_walks[i]) IndexError: pop index out of range

This happens because tmp is a list of length 24, and self._pos_walks[i] is 25. The for loop in line 271 iterates through the first elements of self._pos_walks (6 in my case, all with values lower than 24) and populates attr, but fails to continue because it reaches the nonexistent pop index self._pos_walks[i] = 25.

Steps to Reproduce

  1. Modify entities in fit_transform(kg, entities) in pyrdf2vec/examples/countries.py. I used a list of 31 entities as a test case.
entities = ['http://dbpedia.org/resource/U2', 'http://dbpedia.org/resource/Rock_music', 'http://dbpedia.org/resource/Poems_by_Edgar_Allan_Poe', 'http://dbpedia.org/resource/Post-punk', 'http://dbpedia.org/resource/U2', 'http://dbpedia.org/resource/Japanese_yen', 'http://dbpedia.org/resource/Rock_music', 'http://dbpedia.org/resource/Bono', 'http://dbpedia.org/resource/Revolutionary', 'http://dbpedia.org/resource/Rock_music', 'http://dbpedia.org/resource/U2', 'http://dbpedia.org/resource/Acoustic_guitar', 'http://dbpedia.org/resource/The_Edge', 'http://dbpedia.org/resource/Rhythm_and_blues', 'http://dbpedia.org/resource/Larry_Mullen_Jr.', 'http://dbpedia.org/resource/Punk_rock', 'http://dbpedia.org/resource/U2', 'http://dbpedia.org/resource/Live_Aid', 'http://dbpedia.org/resource/The_Joshua_Tree', 'http://dbpedia.org/resource/Music_of_Ireland', 'http://dbpedia.org/resource/Billboard_200', 'http://dbpedia.org/resource/Without_You_(Badfinger_song)', 'http://dbpedia.org/resource/The_Joshua_Tree', 'http://dbpedia.org/resource/U2', 'http://dbpedia.org/resource/Achtung_Baby', 'http://dbpedia.org/resource/U2', "http://dbpedia.org/resource/All_That_You_Can't_Leave_Behind", 'http://dbpedia.org/resource/Post-punk', 'http://dbpedia.org/resource/U2', 'http://dbpedia.org/resource/Punk_rock', 'http://dbpedia.org/resource/Dublin']
  1. Change line 271 in rdf2vec.py._update as described in part 1
  2. Run your modified version of rdf2vec/examples/countries.py

Environment

  • Operating system: Fedora Linux 35
  • pyRDF2Vec version: 0.2.3
  • Python version: 3.9.13
  • Random seed: 22

Thanks for looking into it!

@acxcv acxcv added the bug Something isn't working label Sep 26, 2022
@acxcv
Copy link
Author

acxcv commented Sep 28, 2022

I forgot to mention that the code executes flawlessly with the original entities from countries.py, regardless of the above changes to rdf2vec.py.

However, in my example with custom entities, if I use a subset of the custom entities list, entities[:22], a different error occurs:

File "/rdf2vec/r2venv/lib/python3.9/site-packages/pyrdf2vec/embedders/word2vec.py", line 73, in transform
raise ValueError(
ValueError: The entities must have been provided to fit() first before they can be transformed into a numerical vector.

To recap:

  • The original countries.py works in either case
  • The countries.py code with a custom entities list (len 31) causes the above IndexErrors. The exact error depends on whether attr[pos] = ... has been modified or not
  • The countries.py code from above with a shorter entities list (len 22) like in countries.py, causes
    • IndexError, depending on whether the line from Part 1 has been modified or not OR
    • ValueError, with the changes from Part 1 and 2

Does anybody know what's going on here?

@GillesVandewiele
Copy link
Collaborator

GillesVandewiele commented Sep 30, 2022

I don't have much bandwidth to look at this atm. But does padding the list with some dummy entities fix the issue?

Do your entities occur in the KG? Shouldn't it be https instead of http for instance? Maybe test if you can extract walks for a single entity?

@acxcv
Copy link
Author

acxcv commented Sep 30, 2022

Hi Gilles,

Thanks for your reply.

The problem was simply that there were duplicates in the entities list.

@acxcv acxcv closed this as completed Sep 30, 2022
@acxcv acxcv changed the title IndexError: pop index out of range in rdf2vec.RDF2VecTransformer._update() Duplicates in entities – IndexError: pop index out of range in rdf2vec.RDF2VecTransformer._update() Sep 30, 2022
@GillesVandewiele
Copy link
Collaborator

Ok thanks for the update! I will re-open the issue however as that is something we could detect for users and raise a more clear error!

@GillesVandewiele GillesVandewiele changed the title Duplicates in entities – IndexError: pop index out of range in rdf2vec.RDF2VecTransformer._update() Cryptical error msg for duplicates in entities Sep 30, 2022
@Ritten11
Copy link

Hi!

Any updates on this subject? I am running into similar issues. The relevant portion of the my code is as follows:

55 def fit_embedding(transformer, knowledge_graph, nodes, epochs_list, rep, sub_dir):
56    """
57
58    :param transformer: The RDF2VecTransformer used for making the embeddings
59    :param knowledge_graph: Instance of RDF2Vec.Graph that is to be embedded
60    :param nodes: Instances from which an embedding should be made. Should be a list of strings.
61    :param epochs_list: List of epochs at which the embedding should be saved
62    :param rep: The current repetition of the embedding. Sometimes multiple embeddings of the save graph are made, and
63    this is needed for saving the embedding to the right directory
64    :param sub_dir: subdirectory to which the embedding should be saved.
65    :return:
66    """
67    # loss_df = pd.DataFrame(columns=['epoch', 'loss'])
68    print('Starting fitting of word2vec embedding:')
69
70    bar = progressbar.ProgressBar(maxval=max(epochs_list), widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
71    bar.start()
72    walks = transformer.get_walks(knowledge_graph, nodes)
73    for e in range(max(epochs_list)):
74        transformer.embedder.fit(walks, False)
75        if (e+1) in epochs_list:
76            embeddings, literals = transformer.transform(knowledge_graph, nodes)
77            save_embeddings(embeddings, literals, e+1, rep, sub_dir)
78    bar.finish()
79    return 

Note that the nodes object is exactly the same for the transformer.get_walks() and both transformer.tranform() calls.

This piece of code produces the following error:

File "/create_embedding.py", line 76, in fit_embedding 
embeddings, literals = transformer.transform(knowledge_graph, nodes) 
File "/.pyenv/versions/KRW_project-3.10.4/lib/python3.10/site-packages/pyrdf2vec/rdf2vec.py", line 214, in transform 
embeddings = self.embedder.transform(entities)    
File "/.pyenv/versions/KRW_project-3.10.4/lib/python3.10/site-packages/pyrdf2vec/embedders/word2vec.py", line 73, in transform
raise ValueError(
ValueError: The entities must have been provided to fit() first before they can be transformed into a numerical vector. 

The initialization of the RDF2Vec transformer is done using:

def init_transformer(seed):
    # Create our transformer, setting the embedding & walking strategy.
    transformer = RDF2VecTransformer(
        Word2Vec(epochs=1, workers=10),
        walkers=[RandomWalker(4, 10, with_reverse=True, n_jobs=10, random_state=seed)],
        verbose=2
    )
    return transformer

At this point, I'm not sure where to look for a potential cause for this error. Note that when the RandomWalker is initialized with with_reverse=False, the script runs without throwing any errors (although I have yet to confirm that it produces meaningful embeddings).

Any suggestions are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants