TVQA Dataset

TVQA Dataset - Download and Description

Table of Contents

1. Annotations (questions, answers, etc)
2. Subtitles
3. Video features
4. Video frames

1. Annotations (questions, answers, etc)

Download link: tvqa_qa_release.tar.gz [15MB]
md5sum: 7f751d611848d0756ee4b760446ef7cf

tvqa_qa_release.tar.gz file contains 3 JSON Line files, each denotes a split of TVQA dataset:

File	#QAs	Usage
tvqa_train.jsonl	122,039	Model training
tvqa_val.jsonl	15,253	Hyperparameter tuning
tvqa_test_public.jsonl	7,623	Model testing. Labels are not released for this set, please upload your predictions to the server for testing.

Note, the original test set described in the TVQA paper is split into two subsets, test-public (7,623 QAs) and test-challenge (7,630 QAs). The test-public set is used for paper publication and is shown in the leaderboard, the test-challenge set is reserved for future use.

Each line of these files can be loaded as a JSON object, containing the following entries:

Key	Type	Description
qid	int	question id
q	str	question
a0, ..., a4	str	multiple choice answers
answer_idx	int	answer index, this entry does not exist for test_public
ts	str	timestamp annotation. e.g. '76.01-84.2' denotes the localized span starts at 76.01 seconds, ends at 84.2 seconds.
vid_name	str	name of the video clip accompanies the question. The videos are named following the format '{show_name_abbr}_s{season_number}e{episode_number}_seg{segment_number}_clip_{clip_number}' e.g. 'friends_s06e12_seg02_clip_16' denotes the video clip is from season 6 episode 12 of the TV show 'Friends', it is the 16th clip of the 2nd segment. An episode typically has two segments, divided by the opening song. Also, note video clips for 'The Big Bang Theory' do not have '{show_name_abbr}' in their 'vid_name'.
show_name	str	name of the TV show

A sample of the QA is shown below:


{
"a0": "A martini glass",
"a1": "Nachos",
"a2": "Her purse",
"a3": "Marshall's book",
"a4": "A beer bottle",
"answer_idx": 4,
"q": "What is Robin holding in her hand when she is talking to Ted about Zoey?",
"qid": 7,
"ts": "1.21-8.49",
"vid_name": "met_s06e22_seg01_clip_02",
"show_name":
}

2. Subtitles

Download link: tvqa_subtitles.tar.gz [14MB]
md5sum: 70094363db36357f4ad4f52ae68a0af8

tvqa_subtitles.tar.gz file contains all the subtitles for TVQA dataset. The subtitles are in .srt format, they can be processed using tools such as pysrt. A sample of the subtitle is shown below:


...

9
00:00:19,275 --> 00:00:20,775
(Ted:)Dude, that's my girlfriend.

10
00:00:20,776 --> 00:00:25,146
(Barney:)Point is, we are taking her
and The Arcadian down.

11
00:00:25,147 --> 00:00:26,548
(Barney:)Am I right, Teddy Westside?

12
00:00:26,549 --> 00:00:28,300
(Ted:)You know it.
Ha-ha!

13
00:00:28,301 --> 00:00:30,118
(Lily:)Okay. See, that's
so weird to me.

...

3. Video features

3.1 ImageNet feature, download link: tvqa_imagenet_resnet101_pool5_hq.tar.gz [34GB]

tvqa_imagenet_resnet101_pool5_hq.tar.gz file contains a HDF5 file. The 2048D features are extracted using ImageNet pretrained ResNet-101 model, at pool5 layer. For each clip, we use at most 300 frames. If the number of frames exceeds, downsampling is applied:
downsample_idx = np.linspace(0, total_number_of_frames, 300).astype(np.int)
To download the files stored in Google Drive, we recommend you to use command line tools such as gdrive.

3.2 Visual concepts feature, download link: det_visual_concepts_hq.pickle.tar.gz [97MB]

det_visual_concepts_hq.pickle.tar.gz file contains a Python dict with 'vid_name' as keys, each value is a list of sentences, each sentence contains the detected objects and attribute labels of a single frame from a modified Faster R-CNN trained on Visual Genome. Note this feature is also downsampled as the ImageNet feature.

3.3 Regional visual feature: please follow the instructions here to do the extraction. Currently, we do not plan to release it, due to its size.

4. Video frames

Download link: tvqa_video_frames_fps3_hq.tar.gz [43GB], please fill out this form first

The video frames are extracted at 3 frames per second (FPS), we show a sample of them below. To obtain the frames, please fill out the form first. You will be required to provide information about you and your advisor, as well as sign our agreement. The download link for the video frames will be sent to you in around a week if your form is valid. Please do not share the video frames with others.