Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hallucinations] Repetition of words or chunks with own fine-tuned model #987

Open
asr-lord opened this issue Sep 2, 2024 · 2 comments
Open

Comments

@asr-lord
Copy link

asr-lord commented Sep 2, 2024

I've created a "real-time" application with chunks of 3 sec using my own small fine-tuned model. It reads the complete audio call recording and generate 3s chunks, but in some cases I get repetition of the same word/s:

I've converted the model by the following code:

!pip install transformers[torch]>=4.23 ctranslate2
from ctranslate2.converters import TransformersConverter

model_name_or_path = "/home/whisper-small-fine-tuned"
output_dir = "/home/whisper-small-ct2"

converter = TransformersConverter(model_name_or_path)
converter.convert(output_dir, quantization="float16", force=True)

And run the following code to get transcription:

from faster_whisper import WhisperModel

model_size = "/home/whisper-small-ct2"

# Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16")

segment, info = model.transcribe(
    wav_array_chunk_16khz,
    initial_prompt="Venta telefonica",
    language="es",
    task="transcribe",
    hotwords=None,
    word_timestamps=True,
    vad_filter=True, vad_parameters=dict(min_silence_duration_ms=500),
    chunk_length=5,
    condition_on_previous_text=False,
    suppress_tokens=[],
)

Output text transcription:
['no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo', 'soy', 'emigante', 'y', 'hasta', 'hoy', 'no', 'porque', 'yo']

Notice that the factor to transcribe chunks is lower in fine-tuned model than original OpenAI model:
*I'm using 4/16 T4-GPU AWS instance

{'small': 
	{'float32': 5.158, 'float16': 7.313},
 'whisper-small-ct2': 
	{'float32': 2.32, 'float16': 3.383},
 'medium': 
 	{'float32': 2.608, 'float16': 4.966}}
@asr-lord asr-lord changed the title [Hallucinations] Repetition of words or chunks with own fine-tuned models [Hallucinations] Repetition of words or chunks with own fine-tuned model Sep 3, 2024
@guidoveritone
Copy link

having exactly a similar issue, with another audio file, but it seems like the problem is the same.

@MahmoudAshraf97
Copy link
Collaborator

usage with chunk length less than 30s was improved greatly in #1101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants