perception for companion AI #1782
Abdulrahman392011
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I was trying to build a necklace that can work off the grid and be actually useful.
I thought building the software first while putting the power efficiency in mind to be able to run it on battery.
the idea is to have a large library that the model do RAG on... but the model has to be able to listen along without me needing to speak to it. it automatically find relevant information and present it to me when I am free.
the first part is done and however now I am at the second part which is making it perceptive enough to follow along and understand what is going on without me directly telling it.
the initial thought was to use whisper and llava to provide a transcription and description. but I soon found out that this isn't enough for the model to be able to follow along. it got confused because it doesn't understand who is saying what and what is the significance of that person to the user.
I then tried to do some training for the model to be able to detect who is who by what they look like and sound like. however I needed to postpone the training to be done at the end of the day when the user remove the necklace and put it on the charger and go for 8 hours of sleep or something. this needed to be done as training on battery is not feasible with current power efficiency of today's hardware.
still, I soon were to find that a person may meet multiple people in the day and the data curated at the end of the day is unstructured and that is problematic for the training process. for example you may meet 4 people in the day. 4 new voiceprints, 4 new faces and bodies but there is no real way to tell which of these 4 persons' voiceprints belong to which of the 4 people. the same goes for appearances.
after all what you read, if you read. pyannote can help identify which voiceprint belong to which person and structure the data. still the faces and body are a bit unstructured but the data of the voiceprint structure may help in the appearances structuring.
if you have any comments or suggestions,...
Beta Was this translation helpful? Give feedback.
All reactions