minedojo.data package¶
minedojo.data module¶
- class minedojo.data.youtube_dataset.YouTubeDataset(*, download=True, download_dir=None, full=True)[source]¶
Bases:
object
Class for MineDojo YouTube Database API. We follow PyTorch Dataset format but without actually inheriting from PyTorch dataset to keep the framework general.
- Parameters
download¶ (bool) – If set to
True
and there is no existing cache directory, the data will be downloaded automatically.download_dir¶ (None | str) – Directory path where the downloaded data will be saved. Default:
~/.minedojo/
.full¶ (bool) – If
True
, the full version of the YouTube database will be downlaoded. IfFalse
, only the tutorial version of the YouTube database will be downloaded. Default:True
.
Examples
>>> from minedojo.data import YouTubeDataset >>> youtube_dataset = YouTubeDataset() >>> print(youtube_dataset[0].keys()) dict_keys(['id', 'title', 'link', 'view_count', 'like_count', 'duration', 'fps'])
- class minedojo.data.wiki_dataset.WikiDataset(*, download=True, download_dir=None, full=True)[source]¶
Bases:
object
Class for MineDojo Wiki Database API. We follow PyTorch Dataset format but without actually inheriting from PyTorch dataset to keep the framework general.
- Parameters
download¶ (bool) – If
True
and there is no existing cache directory, the data will be downloaded automatically.download_dir¶ (None | str) – Directory path where the downloaded data will be saved. Default:
~/.minedojo/
.full¶ (bool) – If
True
, the full version of the Wiki database will be downlaoded. IfFalse
, only a sample version of the Wiki database will be downloaded. Default:True
.
Examples
>>> from minedojo.data import WikiDataset >>> wiki_dataset = WikiDataset() >>> print(wiki_dataset[0].keys()) dict_keys(['metadata', 'tables', 'images', 'sprites', 'texts', 'screenshot'])
- class minedojo.data.reddit_dataset.RedditDataset(*, download=True, download_dir=None, client_id=None, client_secret=None, user_agent=None, max_comments=100)[source]¶
Bases:
object
Class for MineDojo Reddit Database API. We follow PyTorch Dataset format but without actually inheriting from PyTorch dataset to keep the framework general. See https://praw.readthedocs.io/en/stable/getting_started/quick_start.html for setting up
client_id
,cliend_secret
anduser_agent
.- Parameters
download¶ (bool) – If
True
and there is no existing cache directory, the data will be downloaded automatically.download_dir¶ (None | str) – Directory path where the downloaded data will be saved. Default:
~/.minedojo/
.client_id¶ (str) – The client ID to access Reddit’s API as a script application.
client_secret¶ (str) – The client secret to access Reddit’s API as a script application.
user_agent¶ (str) – A unique identifier that helps Reddit determine the source of network requests.
max_comments¶ (int) – Maximum number of comments to load.
Examples
>>> from minedojo.data import RedditDataset >>> reddit_dataset = RedditDataset(client_id={your_client_id}, client_secret={your_client_secret}, user_agent={your_user_agent}) >>> print(reddit_dataset[0].keys()) dict_keys(['id', 'title', 'link', 'score', 'num_comments', 'created_utc', 'type', 'content', 'comments'])
- get_metadata(post_id, post_type)[source]¶
Get post metadata using PRAW.
- Parameters
- Return type
dict
- Returns
A dictionary containing the metadata of the post.
id(
str
) - The unique, base36 Reddit post ID.title(
str
) - The title of the Reddit post.link(
str
) - The url of the Reddit post.score(
int
) - The score of the Reddit post.num_comments(
int
) - The number of comments under the Reddit post. Does not account for deleted comments.created_utc(
int
) - The date and time the Reddit post was created, in UTC format.type(
str
) - The type of the post, either “image”, “text”, “video” or “link”.content(
str
) - If text type post, text in post body. Otherwise, the media source url or website link.comments(
list[dict]
)id(
str
) - The unique base36 comment ID.parent_id(
str
) - The ID of the comment’s parent in the nested comment tree.content(
str
) - The text in comment body.