Shortcuts

minedojo.data package

minedojo.data module

class minedojo.data.youtube_dataset.YouTubeDataset(*, download=True, download_dir=None, full=True)[source]

Bases: object

Class for MineDojo YouTube Database API. We follow PyTorch Dataset format but without actually inheriting from PyTorch dataset to keep the framework general.

Parameters
  • download (bool) – If set to True and there is no existing cache directory, the data will be downloaded automatically.

  • download_dir (None | str) – Directory path where the downloaded data will be saved. Default: ~/.minedojo/.

  • full (bool) – If True, the full version of the YouTube database will be downlaoded. If False, only the tutorial version of the YouTube database will be downloaded. Default: True.

Examples

>>> from minedojo.data import YouTubeDataset
>>> youtube_dataset = YouTubeDataset()
>>> print(youtube_dataset[0].keys())
dict_keys(['id', 'title', 'link', 'view_count', 'like_count', 'duration', 'fps'])
class minedojo.data.wiki_dataset.WikiDataset(*, download=True, download_dir=None, full=True)[source]

Bases: object

Class for MineDojo Wiki Database API. We follow PyTorch Dataset format but without actually inheriting from PyTorch dataset to keep the framework general.

Parameters
  • download (bool) – If True and there is no existing cache directory, the data will be downloaded automatically.

  • download_dir (None | str) – Directory path where the downloaded data will be saved. Default: ~/.minedojo/.

  • full (bool) – If True, the full version of the Wiki database will be downlaoded. If False, only a sample version of the Wiki database will be downloaded. Default: True.

Examples

>>> from minedojo.data import WikiDataset
>>> wiki_dataset = WikiDataset()
>>> print(wiki_dataset[0].keys())
dict_keys(['metadata', 'tables', 'images', 'sprites', 'texts', 'screenshot'])
exception minedojo.data.reddit_dataset.RedditAPIKeyNotSpecifiedError[source]

Bases: Exception

class minedojo.data.reddit_dataset.RedditDataset(*, download=True, download_dir=None, client_id=None, client_secret=None, user_agent=None, max_comments=100)[source]

Bases: object

Class for MineDojo Reddit Database API. We follow PyTorch Dataset format but without actually inheriting from PyTorch dataset to keep the framework general. See https://praw.readthedocs.io/en/stable/getting_started/quick_start.html for setting up client_id, cliend_secret and user_agent.

Parameters
  • download (bool) – If True and there is no existing cache directory, the data will be downloaded automatically.

  • download_dir (None | str) – Directory path where the downloaded data will be saved. Default: ~/.minedojo/.

  • client_id (str) – The client ID to access Reddit’s API as a script application.

  • client_secret (str) – The client secret to access Reddit’s API as a script application.

  • user_agent (str) – A unique identifier that helps Reddit determine the source of network requests.

  • max_comments (int) – Maximum number of comments to load.

Examples

>>> from minedojo.data import RedditDataset
>>> reddit_dataset = RedditDataset(client_id={your_client_id}, client_secret={your_client_secret}, user_agent={your_user_agent})
>>> print(reddit_dataset[0].keys())
dict_keys(['id', 'title', 'link', 'score', 'num_comments', 'created_utc', 'type', 'content', 'comments'])
get_comments(post)[source]
Return type

list[dict]

get_metadata(post_id, post_type)[source]

Get post metadata using PRAW.

Parameters
  • post_id (str) – The unique, base36 ID of a Reddit post.

  • post_type (str) – The type of the post, either “image”, “text”, “video” or “link”.

Return type

dict

Returns

A dictionary containing the metadata of the post.

  • id(str) - The unique, base36 Reddit post ID.

  • title(str) - The title of the Reddit post.

  • link(str) - The url of the Reddit post.

  • score(int) - The score of the Reddit post.

  • num_comments(int) - The number of comments under the Reddit post. Does not account for deleted comments.

  • created_utc(int) - The date and time the Reddit post was created, in UTC format.

  • type(str) - The type of the post, either “image”, “text”, “video” or “link”.

  • content(str) - If text type post, text in post body. Otherwise, the media source url or website link.

  • comments(list[dict])

    • id(str) - The unique base36 comment ID.

    • parent_id(str) - The ID of the comment’s parent in the nested comment tree.

    • content(str) - The text in comment body.