Skip to content

Latest commit

 

History

History
41 lines (35 loc) · 4.17 KB

sharechat_trending_content_scraper.md

File metadata and controls

41 lines (35 loc) · 4.17 KB

Scraper function loaded from Sharechat scrapers Runs when content_to_scrape="trending" in Config. Scrapes content from the "trending" tab on the tag page. The scraped content will not be chronological.

Scraper workflow:

The scraper performs the following steps using helper functions imported from the Sharechat helper and S3 Mongo helper modules.

  1. Initializes S3 and Mongo DB connections. This is done first to ensure that any authentication errors are caught before the scraping begins.

  2. Calls get_trending_data() which does the actual scraping as follows -

    1. Initializes an empty Pandas dataframe df with column labels corresponding to the content that will be scraped.

    2. Starts a loop that will scrape from each tag in the list of tag hashes entered in config.

      For each tag:

      1. Generates a requests dictionary with generate_requests_dict(). The returned dictionary contains parameters required for replicating Sharechat API requests described below

      2. Sends a requestType66 to get data about the tag

      3. Scrapes the json response with a helper function called get_tag_data() . This returns tag_name, tag_translation, tag_genre, bucket_name, bucket_id

      4. Starts a loop to scrape "n" pages of trending post data from the tag, where "n" is the number of pages entered in config

        For each page:

        1. Sends a getViralPostsSeo request using the helper function get_response_dict(). This returns a json response containing data about all the posts inside the page (usually around 10)
        2. Scrapes the json response with a helper function called get_post_data() . This returns a dataframe called post_data containing the following metadata for each post - media_link, timestamp, language, media_type, external_shares, likes, comments, reposts, post_permalink, caption, text, views, profile_page.
          Note that some of the dataframe's column names are different from the metadata labels generated by Sharechat, eg. 'usc' is renamed as 'external_shares'
        3. Scrapes the json response with a helper function called get_next_offset_hash(). This returns a hash which is required to scrape the next page of the tag. The hash is included in the request sent by get_response_dict() in the next iteration of the loop.
        4. Appends the post_data to the main dataframe
        5. Pauses for 30-35 seconds (random time delay to avoid bombarding the Sharechat API with requests)
        6. Starts a loop to get additional content by content type (image, video or text) This is to compensate for the possibility that the content scraped so far is dominated by one content type.
          For each of these content types:
          1. Sends requestType88 to get data about all the posts under that type
          2. Scrapes and appends the post data to the main dataframe in the same way as above
    3. Drops duplicate posts (rows) from the main dataframe

    4. Transforms all the post timestamps with datetime.utcfromtimestamp() in accordance with Tattle's datetime conventions

    5. Adds a uuid filename used for identification across S3/Mongo to each post

    6. Adds the scraped_date to each post

    7. Returns the main dataframe

  3. Uploads the scraped data to an S3 bucket with a helper function called sharechat_s3_upload() that uses the common s3_mongo_helper module. This function returns the dataframe with an **S3 url **added to each post (row)

    If the S3 upload is successful, the scraper proceeds to step 4. If the S3 upload fails, the scraper jumps to step 8.

  4. Generates thumbnails for scraped images and videos saved on S3

  5. Creates and locally saves HTML file containing the scraped content and thumbnails. This is handy for previewing and sharing the content.

  6. Uploads the scraped data including S3 urls to Mongo DB with a helper function called sharechat_mongo_upload() that uses the common s3_mongo_helper module

  7. Locally saves a CSV file containing the scraped content. This is handy for previewing, annotating and sharing the content.

  8. Returns the final complete dataframe containing all scraped data (posts and their metadata)