minedojo.data package¶

minedojo.data module¶

class minedojo.data.youtube_dataset.YouTubeDataset(*, download=True, download_dir=None, full=True)[source]¶

Bases: object

Class for MineDojo YouTube Database API. We follow PyTorch Dataset format but without actually inheriting from PyTorch dataset to keep the framework general.

Parameters

download¶ (bool) – If set to True and there is no existing cache directory, the data will be downloaded automatically.
download_dir¶ (None | str) – Directory path where the downloaded data will be saved. Default: ~/.minedojo/.
full¶ (bool) – If True, the full version of the YouTube database will be downlaoded. If False, only the tutorial version of the YouTube database will be downloaded. Default: True.

Examples

>>> from minedojo.data import YouTubeDataset
>>> youtube_dataset = YouTubeDataset()
>>> print(youtube_dataset[0].keys())
dict_keys(['id', 'title', 'link', 'view_count', 'like_count', 'duration', 'fps'])

class minedojo.data.wiki_dataset.WikiDataset(*, download=True, download_dir=None, full=True)[source]¶

Bases: object

Class for MineDojo Wiki Database API. We follow PyTorch Dataset format but without actually inheriting from PyTorch dataset to keep the framework general.

Parameters

download¶ (bool) – If True and there is no existing cache directory, the data will be downloaded automatically.
download_dir¶ (None | str) – Directory path where the downloaded data will be saved. Default: ~/.minedojo/.
full¶ (bool) – If True, the full version of the Wiki database will be downlaoded. If False, only a sample version of the Wiki database will be downloaded. Default: True.

Examples

>>> from minedojo.data import WikiDataset
>>> wiki_dataset = WikiDataset()
>>> print(wiki_dataset[0].keys())
dict_keys(['metadata', 'tables', 'images', 'sprites', 'texts', 'screenshot'])

exception minedojo.data.reddit_dataset.RedditAPIKeyNotSpecifiedError[source]¶: Bases: Exception

class minedojo.data.reddit_dataset.RedditDataset(*, download=True, download_dir=None, client_id=None, client_secret=None, user_agent=None, max_comments=100)[source]¶

Bases: object

Class for MineDojo Reddit Database API. We follow PyTorch Dataset format but without actually inheriting from PyTorch dataset to keep the framework general. See https://praw.readthedocs.io/en/stable/getting_started/quick_start.html for setting up client_id, cliend_secret and user_agent.

Parameters

download¶ (bool) – If True and there is no existing cache directory, the data will be downloaded automatically.
download_dir¶ (None | str) – Directory path where the downloaded data will be saved. Default: ~/.minedojo/.
client_id¶ (str) – The client ID to access Reddit’s API as a script application.
client_secret¶ (str) – The client secret to access Reddit’s API as a script application.
user_agent¶ (str) – A unique identifier that helps Reddit determine the source of network requests.
max_comments¶ (int) – Maximum number of comments to load.

Examples

>>> from minedojo.data import RedditDataset
>>> reddit_dataset = RedditDataset(client_id={your_client_id}, client_secret={your_client_secret}, user_agent={your_user_agent})
>>> print(reddit_dataset[0].keys())
dict_keys(['id', 'title', 'link', 'score', 'num_comments', 'created_utc', 'type', 'content', 'comments'])

get_comments(post)[source]¶

Return type: list[dict]

get_metadata(post_id, post_type)[source]¶

Get post metadata using PRAW.

Parameters

post_id¶ (str) – The unique, base36 ID of a Reddit post.
post_type¶ (str) – The type of the post, either “image”, “text”, “video” or “link”.

Return type

dict

Returns

A dictionary containing the metadata of the post.

id(str) - The unique, base36 Reddit post ID.
title(str) - The title of the Reddit post.
link(str) - The url of the Reddit post.
score(int) - The score of the Reddit post.
num_comments(int) - The number of comments under the Reddit post. Does not account for deleted comments.
created_utc(int) - The date and time the Reddit post was created, in UTC format.
type(str) - The type of the post, either “image”, “text”, “video” or “link”.
content(str) - If text type post, text in post body. Otherwise, the media source url or website link.
comments(list[dict])
- id(str) - The unique base36 comment ID.
- parent_id(str) - The ID of the comment’s parent in the nested comment tree.
- content(str) - The text in comment body.