Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model always giving the same description #24

Open
aasishkc4 opened this issue Feb 20, 2025 · 5 comments
Open

Model always giving the same description #24

aasishkc4 opened this issue Feb 20, 2025 · 5 comments

Comments

@aasishkc4
Copy link

I was trying the new omni-research/Tarsier2-Recap-7b model. But even after making changes to the input prompt the model always outputs the same video description. Is there some parameter that needs to be changed? Currently, the parameters are as default provided within the GitHub tarsier 2 branch.

@jwwang424
Copy link
Collaborator

Tarsier2-Recap-7b Is only trained with the Tarsier-Recap-585K,which is distilled from Tarsier-7B with the only one prompt of "Describe the video in detail". That's why it outputs the same description despite the user prompt.

@PeterWangyi
Copy link

@jwwang424
Hello, tarsier-recap-7b is a great model. But I found two problems when using it:

  1. The model will describe the subtitles of the video
  2. In the case of multiple people, it will only describe two of them

It was mentioned above that only a single prompt was used during training. Did you use the system prompt during training? Or do you have any suggestions for solutions to the above problems?

@jwwang424
Copy link
Collaborator

jwwang424 commented Feb 24, 2025

@jwwang424 Hello, tarsier-recap-7b is a great model. But I found two problems when using it:

  1. The model will describe the subtitles of the video
  2. In the case of multiple people, it will only describe two of them

It was mentioned above that only a single prompt was used during training. Did you use the system prompt during training? Or do you have any suggestions for solutions to the above problems?

No system prompt.
For 1. We didn't made any effort in forcing the model to ignore the subtitles, either in our data construction or training procedure. You need to conduct extra post-training (sft or rl) to highlight this requirement.
For 2. The model was not induced to describe only two subjects in the training. It was not supposed to have this preference. How often do you notice such phenomenon? As in the case of "assets/videos/coffee.gif", tarsier-recap-7b described all three people in the clip.

@aasishkc4
Copy link
Author

I have found a solution around which helps. The CLI version which has the Chat option which outputs the summary first then we can query new prompt again. This will send the previous context with the new prompt. By this way the model is able to understand and react to the new prompt.

@aasishkc4
Copy link
Author

Hey @jwwang424
I need a small help, I have two video.

  1. A person opening a door
  2. A person rotating a big circular valve.

In 1. I need to know if the person is pulling or pushing the door after opening. Similarly, in 2. I want to know the rotation direction.
The model always says pull for 1 and clockwise for rotation. But in Tarsier1 I had prompt which was somehow giving me the right thing.

Tarsier1 prompt

  1. After the hand grips the door handle, does it move outward (push) or inward (pull) to open the door?
  2. Analyze the sequence of frames to determine the rotation direction of the valve handle. Identify whether it moves clockwise or counterclockwise by tracking its position relative to a fixed reference point on the valve body.

Let me know if you have any idea about this.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants