Model always giving the same description #24

aasishkc4 · 2025-02-20T15:30:11Z

I was trying the new omni-research/Tarsier2-Recap-7b model. But even after making changes to the input prompt the model always outputs the same video description. Is there some parameter that needs to be changed? Currently, the parameters are as default provided within the GitHub tarsier 2 branch.

jwwang424 · 2025-02-21T02:15:07Z

Tarsier2-Recap-7b Is only trained with the Tarsier-Recap-585K，which is distilled from Tarsier-7B with the only one prompt of "Describe the video in detail". That's why it outputs the same description despite the user prompt.

PeterWangyi · 2025-02-24T03:38:25Z

@jwwang424
Hello, tarsier-recap-7b is a great model. But I found two problems when using it:

The model will describe the subtitles of the video
In the case of multiple people, it will only describe two of them

It was mentioned above that only a single prompt was used during training. Did you use the system prompt during training? Or do you have any suggestions for solutions to the above problems?

jwwang424 · 2025-02-24T04:58:39Z

@jwwang424 Hello, tarsier-recap-7b is a great model. But I found two problems when using it:

The model will describe the subtitles of the video

In the case of multiple people, it will only describe two of them

It was mentioned above that only a single prompt was used during training. Did you use the system prompt during training? Or do you have any suggestions for solutions to the above problems?

No system prompt.
For 1. We didn't made any effort in forcing the model to ignore the subtitles, either in our data construction or training procedure. You need to conduct extra post-training (sft or rl) to highlight this requirement.
For 2. The model was not induced to describe only two subjects in the training. It was not supposed to have this preference. How often do you notice such phenomenon? As in the case of "assets/videos/coffee.gif", tarsier-recap-7b described all three people in the clip.

aasishkc4 · 2025-02-24T06:11:20Z

I have found a solution around which helps. The CLI version which has the Chat option which outputs the summary first then we can query new prompt again. This will send the previous context with the new prompt. By this way the model is able to understand and react to the new prompt.

aasishkc4 · 2025-02-26T03:52:29Z

Hey @jwwang424
I need a small help, I have two video.

A person opening a door
A person rotating a big circular valve.

In 1. I need to know if the person is pulling or pushing the door after opening. Similarly, in 2. I want to know the rotation direction.
The model always says pull for 1 and clockwise for rotation. But in Tarsier1 I had prompt which was somehow giving me the right thing.

Tarsier1 prompt

After the hand grips the door handle, does it move outward (push) or inward (pull) to open the door?
Analyze the sequence of frames to determine the rotation direction of the valve handle. Identify whether it moves clockwise or counterclockwise by tracking its position relative to a fixed reference point on the valve body.

Let me know if you have any idea about this.
Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model always giving the same description #24

Model always giving the same description #24

aasishkc4 commented Feb 20, 2025

jwwang424 commented Feb 21, 2025

PeterWangyi commented Feb 24, 2025

jwwang424 commented Feb 24, 2025 •

edited

Loading

aasishkc4 commented Feb 24, 2025

aasishkc4 commented Feb 26, 2025

Model always giving the same description #24

Model always giving the same description #24

Comments

aasishkc4 commented Feb 20, 2025

jwwang424 commented Feb 21, 2025

PeterWangyi commented Feb 24, 2025

jwwang424 commented Feb 24, 2025 • edited Loading

aasishkc4 commented Feb 24, 2025

aasishkc4 commented Feb 26, 2025

Tarsier1 prompt

jwwang424 commented Feb 24, 2025 •

edited

Loading