Reddit Scraper Tool

The Reddit Scraper Tool is a custom tool that allows the Newsletter AI Agent to gather discussions from Reddit using an Apify actor. It provides a way to collect community insights and discussions related to the specified topic.

Overview

The Reddit Scraper Tool is primarily used by the Researcher Agent to gather community discussions about the specified topic. It provides a flexible interface for searching Reddit and extracting structured data from posts and comments.

Implementation

The Reddit Scraper Tool is implemented as a CrewAI BaseTool that interacts with an Apify Reddit scraper actor. Here’s the implementation:

from crewai.tools import BaseTool
from pydantic import BaseModel, Field, ConfigDict
from typing import List, Optional, Literal
from apify import Actor
from src.tools.base import RunApifyActor

class RedditScraperInput(BaseModel):
    """Input schema for RedditScraper tool."""
    searches: List[str] = Field(
        description="Here you can provide a search query which will be used to search Reddit's topics."
    )

    startUrls: Optional[List[str]] = Field(
        description="If you already have URL(s) of page(s) you wish to scrape, you can set them here. If you want to use the search field below, remove all startUrls here.",
        default=None
    )
    
    skipComments: Optional[bool] = Field(
        default=False,
        description="This will skip scrapping comments when going through posts"
    )
    
    # Additional parameters...

class RedditScraperTool(BaseTool):
    name: str = "Reddit Scraper"
    description: str = "Tool for scraping Reddit content with configurable parameters"
    args_schema: type[BaseModel] = RedditScraperInput
    actor: Actor = Field(description="Apify Actor instance")
    model_config = ConfigDict(arbitrary_types_allowed=True)
    
    def _run(
        self,
        searches: List[str],
        startUrls: Optional[List[str]] = None,
        skipComments: Optional[bool] = False,
        # Additional parameters...
    ) -> str:
        run_inputs = {}
        
        if searches:
            run_inputs["searches"] = searches
        if startUrls:
            run_inputs["startUrls"] = startUrls
        if skipComments:
            run_inputs["skipComments"] = skipComments
        # Set additional parameters...
        
        run_actor = RunApifyActor(self.actor)
        dataset = run_actor._run("reddit-scraper-actor-name", run_inputs)
        return dataset

Parameters

The Reddit Scraper Tool accepts the following parameters:

Parameter	Type	Description	Default
`searches`	List[str]	Search queries for Reddit topics	Required
`startUrls`	List[str]	Direct URLs to Reddit pages to scrape	None
`skipComments`	bool	Skip scraping comments when processing posts	False
`skipUserPosts`	bool	Skip scraping user posts when processing user activity	False
`skipCommunity`	bool	Skip scraping community info but still get community posts	False
`searchPosts`	bool	Search for posts with the provided search	True
`searchComments`	bool	Search for comments with the provided search	False
`searchCommunities`	bool	Search for communities with the provided search	False
`searchUsers`	bool	Search for users with the provided search	False
`sort`	str	How to sort the results (e.g., “new”, “top”, “hot”)	“new”
`time`	str	Time filter for results	None
`includeNSFW`	bool	Include NSFW content in results	True
`maxPostCount`	int	Maximum number of posts to retrieve	20
`maxComments`	int	Maximum number of comments to retrieve per post	20
`maxCommunitiesCount`	int	Maximum number of communities to retrieve	2
`maxUserCount`	int	Maximum number of users to retrieve	2

Usage

The Reddit Scraper Tool is used by the Researcher Agent to gather community discussions about the specified topic:

# Initialize the tool
reddit_tool = RedditScraperTool(actor=actor)

# Use the tool
reddit_results = reddit_tool._run(
    searches=[topic],
    searchPosts=True,
    searchComments=False,
    sort="relevance",
    maxPostCount=10
)

Return Value

The tool returns a list of Reddit posts and comments, where each item is a dictionary containing information about a post or comment, including:

title: The title of the post (for posts only)
text: The text content of the post or comment
url: The URL of the post or comment
author: The username of the author
score: The score (upvotes - downvotes) of the post or comment
created: The creation date of the post or comment
Additional metadata about the post or comment

Apify Integration

The tool uses an Apify Reddit scraper actor, which provides several advantages:

Scalability: The actor can handle large numbers of Reddit searches efficiently
Reliability: The actor is designed to handle rate limiting and other issues that can arise when scraping Reddit
Structured Data: The actor returns Reddit posts and comments in a structured format that is easy to process

Configuration

To use the Reddit Scraper Tool, you need to set up the following environment variables:

APIFY_API_KEY=your_apify_api_key_here

Next Steps

Learn about the Twitter Scraper Tool
Explore the Researcher Agent that uses this tool
See how this tool contributes to the newsletter generation process

Tools

​Reddit Scraper Tool

​Overview

​Implementation

​Parameters

​Usage

​Return Value

​Apify Integration

​Configuration

​Next Steps