Task Description

This year we will focus on two tasks, i.e., pre-training for video captioning downstream task and pre-training for video categorization downstream task.

In the first track, given the GIF videos and the corresponding captions in ACTION, the goal of pre-training is to learn a generic representation or structure that can better reflect the cross-modal interaction between visual content and textual sentence. The learnt generic representation or structure is further adapted to facilitate the downstream task of video captioning, i.e., describing video content with a complete and natural sentence.


The contestants are asked to develop video captioning system based on the ACTION dataset provided by the Challenge (as pre-training data) and the public MSR-VTT benchmark (as training data for downstream task). For the evaluation purpose, a contesting system is asked to produce at least one sentence of the test videos. The accuracy will be evaluated against human pre-generated sentence(s).

In the second track, given the YouTube videos and the corresponding searched queries & titles in a Weakly-Supervised dataset, the goal is to pre-train a generic video representation, which can be further leveraged to facilitate the downstream task of video categorization.


The contestants are asked to develop video categorization system based on the Weakly-Supervised dataset provided by the Challenge (as pre-training data) and the released Downstream dataset (as training data for downstream task). For the evaluation purpose, a contesting system is asked to predict the category of the test videos. The accuracy will be evaluated against human annotated categories during evaluation stage.

Relevance to Previous Challenges

Most of the organizers in this proposal have successfully co-organized MSR Video to Language Challenge in ACM MM 2016 and ACM MM 2017, and Pre-training for Video Captioning Challenge in ACM MM 2020, as listed below. Previous challenges predominantly focused on training video captioning systems with the manually annotated video-sentence pairs or automatically collected pre-training data. This challenge goes a step beyond video captioning task and targets for pre-training a generic video representation or structure that facilitates a series of video understanding downstream tasks. Since there is no challenge dedicated to pre-training for video understanding (i.e., video captioning and video categorization) in major multimedia conferences, our challenge will offer a valuable venue to foster research into pre-training for video understanding.

MSR Video to Language Challenge 2016, ACM Multimedia 2016 Grand Challenge (30 teams)
MSR Video to Language Challenge 2017, ACM Multimedia 2017 Grand Challenge (15 teams)
Pre-training for Video Captioning Challenge 2020, ACM Multimedia 2020 Grand Challenge (50 teams)



Submission File

To enter the competition, you need to create an account on Evaluation Server. This account allows you to upload your results to the server. Each run must be formatted in a Jason File as

Pre-training for Video Captioning Track

{
  "version": "VERSION 1.3",
  "result":[
  {
    "video_id": "test_video_2020_218",
    "caption": "a monkey fell out of his bed and hit his head"
  },
  ...
  {
    "video_id": "test_video_2020_682",
    "caption": "a person is riding a small motor bike"
  }
  ],
  "external_data":{
    "used": "true", # Boolean flag. True indicates used of pre-trained data.
    "details": "First pre-train captioning model with ACTION and then fine-tune it with MSR-VTT" # String with details of how to train your models with pre-training data, e.g., first pre-train captioning model with ACTION and then fine-tune it with MSR-VTT, or train captioning model over the joint combination of ACTION on GIF and MSR-VTT.
  }
}

              

Pre-training for Video Categorization Track

{
  "version": "VERSION 1.3",
  "result":[
   {
    "video_id": "video2188",
    "category_id": 1
    },
    ...
    {
    "video_id": "video6822",
    "category_id": 2
    }
    ],
    "external_data":{
    "used": "true", # Boolean flag. True indicates used of pre-trained data.
    "details": "First pre-train captioning model with the Weakly-Supervised dataset and then fine-tune it with Downstream" # String with details of how to train your models with pre-training data
    }
}
        

Note: comments in brown are illustrative and help us to provide inline detailed explanations. Please avoid them in your sumisions.

To help with better understanding the format of the submission text file, a sample submission can be seen track_1 and track_2. Participants please strictly follow the submission format.


All the results should be zipped into a single file named by result.zip. Within the zipped folder, results from different runs (up to three) should be placed in separate files (e.g., result1.json, result2.json, result3.json). Every team is also required to upload a one-page notebook paper that briefly describes your system. The paper format follows ACM proceeding style.



Participation

The Challenge is a team-based contest. Each team can have one or more members, and an individual cannot be a member of multiple teams.

At the end of the Challenge, all teams will be ranked based on objective described above. The top three teams will receive award certificates. At the same time, all accepted submissions are qualified for ACM MM 2021 Challenge award competition.