Use the Knowledge Base¶

Minecraft has more than 100M active players, who have collectively generated an enormous wealth of data. MineDojo features a massive database collected automatically from the internet. AI agents can learn from this treasure trove of knowledge to harvest actionable insights, acquire diverse skills, develop complex strategies, and discover interesting objectives to pursue. All our databases are open-access and available to download today! data_cards

YouTube Database¶

Minecraft is among the most streamed games on YouTube. Human players have demonstrated a stunning range of creative activities and sophisticated missions that take hours to complete.We collect 730K+ narrated Minecraft videos, which add up to ~300K hours and 2.2B words in English transcripts. The time-aligned transcripts enable the agent to ground free-form natural language in video pixels and learn the semantics of diverse activities without laborious human labeling.

We provide two sets of YouTube data.

Tutorial videos: Minecraft tutorial videos include step-by-step demonstrations and sometimes detailed verbal explanations. They also serve as a rich source of creative missions that humans find interesting. We harvest thousands of tasks from these videos in our benchmarking suite.
General gameplay videos: Unlike tutorials, general gameplay videos do not necessarily provide guidance on particular tasks. Instead, they capture the “in-the-wild” human experiences that are much larger in quantity, diverse in contents, and rich in learning signals.

To load YouTube data:

>>> from minedojo.data import YouTubeDataset
>>> youtube_dataset = YouTubeDataset(
        full=True,     # full=False for tutorial videos or 
                       # full=True for general gameplay videos
        download=True, # download=True to automatically download data or 
                       # download=False to load data from download_dir
        download_dir={your_download_directory} 
                       # default: "~/.minedojo". You can also manually download data from
                       # https://doi.org/10.5281/zenodo.6641142 and put it in download_dir.           
    ) 
>>> print(len(youtube_dataset))
736940
>>> youtube_item = youtube_dataset[0]
>>> print(youtube_item.keys())
dict_keys(['id', 'title', 'link', 'view_count', 'like_count', 'duration', 'fps'])

Each item in YouTube dataset is a dictionary of the video’s metadata with the following fields:

{
    "id": str,         # video id
    "title": str,      # video title
    "link": str,       # video link
    "view_count": int  # number of times the video has been viewed
    "like_count": int  # number of users who have indicated that they liked the video
    "duration": float  # video duration in seconds
    "fps": float,      # video FPS
}

If you want to download YouTube MP4 files and transcripts directly, there are open-source tools to easily do so.

To display a YouTube video in Jupyter notebook:

from IPython.display import YouTubeVideo
YouTubeVideo(youtube_item["id"])

Wiki Database¶

The Wiki pages cover almost every aspect of the game mechanics, and supply a rich source of unstructured knowledge in multimodal tables, recipes, illustrations, and step-by-step tutorials. We scrape ~7K pages that interleave text, images, tables, and diagrams. To preserve the layout information, we also save the screenshots of entire pages and extract bounding boxes of the visual elements.

To load Wiki data:

>>> from minedojo.data import WikiDataset
>>> wiki_dataset = WikiDataset(
        full=False,    # full=False for a sample subset of Wiki data (quick to download) or 
                       # full=True for the full Wiki dataset
        download=True, # download=True to automatically download data or 
                       # download=False to load data from download_dir
        download_dir={your_download_directory} 
                       # default: "~/.minedojo". You can also manually download data from 
                       # https://doi.org/10.5281/zenodo.6640448 and extract it to download_dir.
    ) 
                                           
>>> print(len(wiki_dataset))
10
>>> wiki_item = wiki_dataset[0]
>>> print(wiki_item.keys())
dict_keys(['metadata', 'tables', 'images', 'sprites', 'texts', 'screenshot'])

Each item in Wiki dataset is a dictionary with the following fields:

{
    "metadata": {
        "url": str,
        "title": str,
        "time": str
    },
    "screenshot": PIL.Image.Image,
    "texts": list[
        {
            "text": str,
            "bbox": [x: int, y: int, h: int, w: int]
        }
    ],
    "images": list[
        {
            "src": str,
            "path": str,
            "alt_text": str,
            "caption": str,
            "bbox": [x: int, y: int, h: int, w: int],
            "rgb": PIL.Image.Image
        }
    ],
    "sprites": list[
        {
            "text": str,
            "bbox": [x: int, y: int, h: int, w: int]
        }
    ],
    "tables": list[
        {
            "bbox": list[x: int, y: int, h: int, w: int],
            "headers": 
            {
                "text": list[str],
                "bboxes": list[list[x: int, y: int, h: int, w: int]]
            },
            "cells": 
            {
                "text": list[str],
                "bboxes": list[list[x: int, y: int, h: int, w: int]]
            }
        },
    ]
}

To display an image in Jupyter notebook:

wiki_item["images"][0]["rgb"]

The screenshot is a large-size screenshot image of a Minecraft Wiki page. Each bbox value stores the location of the top left corner of the html element’s bounding box (x, y), and it’s height and width (h, w).

Here we offer a simple example showing how to plot the bboxes on top of the screenshot:

import PIL
from PIL import Image, ImageDraw

def get_corners(bbox, y_shift: int=0):
    x, y, w, h = bbox
    x0 = x
    y0 = y - y_shift
    x1 = x0 + w
    y1 = y0 + h - y_shift
    return [x0, y0, x1, y1]

def draw_bbox(data, color="red", width=2) -> PIL.Image.Image:
    """Draw bboxes around wiki page screenshot given a data item from Wiki dataset."""
    screenshot = data["screenshot"] 
    draw = ImageDraw.Draw(screenshot)

    for text in data["texts"]:
      draw.rectangle(get_corners(text["bbox"]), outline=color, width=width)
    
    for image in data["images"]:
      draw.rectangle(get_corners(image["bbox"]), outline=color, width=width)

    for table in data["tables"]:
      draw.rectangle(get_corners(table["bbox"]), outline=color, width=width)

    for sprite in data["sprites"]:
      draw.rectangle(get_corners(sprite["bbox"]), outline=color, width=width)
    
    
    return screenshot

This is the screenshot of this Wiki page, with bboxes plotted with red rectangles.

draw_bbox(wiki_item)

Reddit Database¶

We collect 340K+ Reddit posts along with 6.6M comments under the “r/Minecraft” subreddit. These posts ask questions on how to solve certain tasks, showcase cool architectures and achievements in image/video snippets, and discuss general tips and tricks for players of all expertise levels. Large language models can be finetuned on our Reddit corpus to internalize Minecraft-specific concepts and develop sophisticated strategies.

We use PRAW, a Python wrapper on top of the official Reddit API, to scrape the Reddit contents. You need to get your client_id, client_secret, user_agent first to load Reddit database. See the guidelines of PRAW.

>>> from minedojo.data import RedditDataset
>>> reddit_dataset = RedditDataset(
        client_id={your_client_id}, 
        client_secret={your_client_secret}, 
        user_agent={your_user_agent},
        download=True, # download=True to automatically download data or
                       # download=False to load data from download_dir
        download_dir={your_download_directory} 
                       # default: "~/.minedojo". You can also manually download data from 
                       # https://doi.org/10.5281/zenodo.6641114 and put it in download_dir.
    ) 
>>> print(len(reddit_dataset))
345876
>>> reddit_item = reddit_dataset[0]
>>> print(reddit_item.keys())
dict_keys(['id', 'title', 'link', 'score', 'num_comments', 'created_utc', 'type', 'content', 'comments'])

Each item in Reddit dataset is a dictionary with the following fields:

{
    "id": str,           # the unique, base36 id of the Reddit post.
    "title": str,        # the title of the Reddit post.
    "link": str,         # The url of the Reddit post.
    "score": int,        # the score of the Reddit post.
    "num_comments": int, # the number of comments under the Reddit post.
    "created_utc": int,  # the date and time the Reddit post was created, in UTC format.
    "type": str,         # the type of the post - either "image", "text", "video", "link", 
                         # "media" (gif or gif + image), "text_with_media", or "unknown."
    "content": str,      # if text type post, text in post body. Otherwise, the media 
                         # source url or unknown link.
    "comments": 
    [
        {
            "id": str,        # the unique base36 comment id
            "parent_id": str, # id of the comment's parent in the nested comment tree
            "content": str,   # text in comment body
        }
    ]
}

There are 4 types in Reddit, i.e. image, text, video and link. The following shows an example of each of them.

Image type

>>> reddit_item_image = reddit_dataset[3]
>>> print("Reddit post type:", reddit_item_image["type"])
Reddit post type: image

To display an image in Jupyter notebook:

from IPython.display import Image
Image(reddit_item_image["content"])

Text type:

>>> reddit_item_text = reddit_dataset[627]
>>> print("Reddit post type:", reddit_item_text["type"])
Reddit post type: text

Video type:

>>> reddit_item_video = reddit_dataset[0]
>>> print("Reddit post type:", reddit_item_video["type"])
Reddit post type: video
>>> print(reddit_item_video['content'])
https://v.redd.it/ywph5bcan9w31

Link type:

>>> reddit_item_link = reddit_dataset[1335]
>>> print("Reddit post type:", reddit_item_link["type"])
Reddit post type: link