Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collect more dataset to retrain model #5

Open
pocession opened this issue Apr 16, 2024 · 0 comments
Open

collect more dataset to retrain model #5

pocession opened this issue Apr 16, 2024 · 0 comments

Comments

@pocession
Copy link
Owner

codes for labelling new dataset:

def label_text(raw_text, entity):
    # Split the text and entity into words
    words = raw_text.split()
    entity_words = entity.split()

    # Initialize labels with "O" for each word in the text
    labels = ['O'] * len(words)

    # Find the start of the entity in the text
    for i in range(len(words)):
        # Check if the current slice of words matches the entity words
        if words[i:i+len(entity_words)] == entity_words:
            # Label the start of the entity with "B"
            labels[i] = '1'
            # Label the rest of the entity with "I"
            for j in range(1, len(entity_words)):
                labels[i+j] = '2'

    return labels

# Example usage
raw_text = "Sildenafil is also used in both men and women to treat the symptoms of pulmonary arterial hypertension. This is a type of high blood pressure that occurs between the heart and the lungs."
entity = "pulmonary arterial hypertension"

# Get labels for the example
labels = label_text(raw_text, entity)
print("Words:", raw_text.split())
print("Labels:", labels)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant