Task Description

This year we will focus on visual-language pre-training for the downstream task of video captioning. Given the automatically collected GIF videos and the corresponding captions, the goal of visual-language pre-training is to learn a generic representation or structure that can better reflect the cross-modal interaction between visual content and textual sentence. The learnt generic representation or structure is further adapted to facilitate the downstream task of video captioning, i.e., describing video content with a complete and natural sentence.

The contestants are asked to develop video captioning system based on the Auto-captions on GIF dataset provided by the Challenge (as pre-training data) and the public MSR-VTT benchmark (as training data for downstream task). For the evaluation purpose, a contesting system is asked to produce at least one sentence for each test video. The accuracy will be evaluated against human pre-generated sentence(s) during evaluation stage.

Relevance to Previous Challenges

Most of the organizers in this proposal have successfully co-organized MSR Video to Language Challenge in ACM MM 2016 and ACM MM 2017, as listed below. Pervious video captioning challenges predominantly focused on training video captioning systems with the manually annotated video-sentence pairs. This challenge goes a step beyond traditional video captioning and targets for pre-training a generic representation or structure that facilitates downstream video captioning task. Since there is no challenge dedicated to pre-training for video captioning in major multimedia conferences, our challenge will offer a valuable venue to foster research into visual-language pre-training.

MSR Video to Language Challenge 2016, ACM Multimedia 2016 Grand Challenge (30 teams)
MSR Video to Language Challenge 2017, ACM Multimedia 2017 Grand Challenge (15 teams)

Submission File

To enter the competition, you need to create an account on Evaluation Server. This account allows you to upload your results to the server. Each run must be formatted in a Jason File as

  "version": "VERSION 1.3",
    "video_id": "test_video_2020_218",
    "caption": "a monkey fell out of his bed and hit his head"
    "video_id": "test_video_2020_682",
    "caption": "a person is riding a small motor bike"
    "used": "true", # Boolean flag. True indicates used of pre-trained data(i.e., Auto-captions on GIF).
    "details": "First pre-train captioning model with Auto-captions on GIF and then fine-tune it with MSR-VTT" # String with details of how to train your models with pre-training data, e.g., first pre-train captioning model with Auto-captions on GIF and then fine-tune it with MSR-VTT, or train captioning model over the joint combination of Auto-captions on GIF and MSR-VTT.


Note: comments in brown are illustrative and help us to provide inline detailed explanations. Please avoid them in your sumisions.

To help with better understanding the format of the submission text file, a sample submission can be seen here. Participants please strictly follow the submission format.

All the results should be zipped into a single file named by result.zip. Within the zipped folder, results from different runs (up to three) should be placed in separate files (e.g., result1.json, result2.json, result3.json). Every team is also required to upload a one-page notebook paper that briefly describes your system. The paper format follows ACM proceeding style.


The Challenge is a team-based contest. Each team can have one or more members, and an individual cannot be a member of multiple teams.

At the end of the Challenge, all teams will be ranked based on objective described above. The top three teams will receive award certificates. At the same time, all accepted submissions are qualified for ACM MM 2020 Challenge award competition.