Skip to content

Commit

Permalink
post validation, timeformat, data dir, documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Saikamesh committed Feb 16, 2024
1 parent dba6ae1 commit ea38ccd
Show file tree
Hide file tree
Showing 6 changed files with 167 additions and 16 deletions.
12 changes: 11 additions & 1 deletion CHANGELOG.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,14 @@
[Version 0.0.24] - 2024-01-21
- Fix: Video Id & Music Id will no longer be effected by excel file conversion.
- Added: CSV file will now contain a column with the search_key & date on which the query was made.
- Code refractoring & structural changes.
- Code refractoring & structural changes.

[Version 0.0.25] - 2024-01-27
- Change: create_time now shows standard time format instead of unix timestamp.

[Version 0.0.26] - 2024-02-02
- Added: Tiktok posts containing images are no longer returned.
- Change: Individual csv files are now saved in a subfolder inside Data/root. Merged file will stay in Data/root Folder.

[Version 0.0.27] - 2024-02-12
- Added: Documentation for structuring query object
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@ pip install -i https://test.pypi.org/simple/ tt-crawl

# Instructions

To learn how to construct your own query, use the [tiktok documentation](https://developers.tiktok.com/doc/research-api-specs-query-videos/)
To learn how to construct your own query, see [here](./howToQuery.md) or refer to the [tiktok documentation](https://developers.tiktok.com/doc/research-api-specs-query-videos/)


Perform a query
Performing a query


```
Expand Down
126 changes: 126 additions & 0 deletions howToQuery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
## Tiktok research API

How to structure a request body?

Below is the example of a request body.

```json
request_body = {
"query": {
"and": [
{"operation": "IN", "field_name": "region_code", "field_values": ["JP", "US"]},
{"operation": "EQ", "field_name": "hashtag_name", "field_values": ["Valorant"]},
],
"or": [
{"operation": "EQ", "field_name": "video_length", "field_values": ["MID"]},
{"operation": "EQ", "field_name": "video_length", "field_values": ["LONG"]},
],
"not": [
{"operation": "EQ", "field_name": "video_length", "field_values": ["SHORT"]}
],
}
}
```

The request body is a query object which is used to get the information from the api.

Every Query object has 3 children `and`, `or` , `not` . Each of which is a list of conditions.

The `and` conditions specify that all the conditions in the list must be met

The `or` conditions specify that at least one of the conditions in the list must be met

The `not` conditions specify that none of the conditions in the list must be met

A valid query must contain at least one non-empty `and`, `or` or `not` condition lists.

A condition is an object that specifies the field name, the operation, and the field values to restrict the query.

### operation:

---

**Possible values**: "EQ", "IN", "GT", "GTE", "LT", "LTE”

**Value Descriptions:**

**EQ**: equal to

**IN**: in

**GT**: greater than

**LT**: less than

**GTE**: greater than or equal to

**LTE**: less than or equal to

### field_name & field_value:

---

Depending on the field_name you’ve choose, use the corresponding field_values which are related to the field name.

Refer to the below table

| Field Name | Description | Example |
| ------------ | --------------------------------------------------------------------------------- | ------------------------------------ |
| create_date | The video creation date in UTC, presented in the format YYYYMMDD | 20220910 |
| username | The username of the video creator | "cookie_love_122" |
| region_code | A two digit code for the country where the video creator registered their account | ‘US’,’UK’,’IN’,’JP’ … |
| video_id | The unique identifier of the video | 6978662169214864645 |
| hashtag_name | The hashtag associated with the video | "arianagrande", "celebrity" |
| keyword | The keyword in the video description | "tiktok" |
| music_id | The music ID of the video. | 8978345345214861235 |
| effect_id | The effect ID of the video. | 3957392342148643476 |
| video_length | The duration of the video | "SHORT", "MID", "LONG", "EXTRA_LONG" |

`SHORT: <15s,
MID: 15s~1min,
LONG: 1~5min,
EXTRA_LONG: >5min`

---

### Example

Below is an example on how to write query object based on the requirements

Let’s say, you want to get data about videos which contain the hashtag #nfl from region US, And you do not want any videos which are less than 15s. Below is how you write the query for it

```json
request_body = {
"query": {
"and": [
{"operation": "IN", "field_name": "region_code","field_values": ["US"]},
{"operation": "EQ", "field_name": "hashtag_name", "field_values": ["nfl"]},
],
"or": [],
"not": [
{"operation": "EQ", "field_name": "video_length", "field_values": ["SHORT"]}
],
}
}
```

We have left the `or` condition empty as there is no need for it.

consider for a moment that we need to get the data from either US or Japan, below is how we’d structure our query object,

```json
request_body = {
"query": {
"and": [
{"operation": "EQ", "field_name": "hashtag_name", "field_values": ["nfl"]},
],
"or": [
{"operation": "IN", "field_name": "region_code","field_values": ["US"]},
{"operation": "IN", "field_name": "region_code","field_values": ["JP"]}
],
"not": [
{"operation": "EQ", "field_name": "video_length", "field_values": ["SHORT"]}
],
}
}
```
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

setup(
name="tt_crawl",
version="0.0.24",
version="0.0.27",
description="A TikTok crawler",
long_description=open("README.md").read() + "\n\n" + open("CHANGELOG.txt").read(),
long_description_content_type="text/markdown",
Expand All @@ -21,6 +21,6 @@
classifiers=classifiers,
keywords="TikTok, TikTok Research API, TikTok Data",
packages=find_packages(),
install_requires=["requests==2.31.0", ],
python_requires=">=3.10",
install_requires=["requests==2.31.0", ],
)
24 changes: 13 additions & 11 deletions tt_crawl/tt_crawler.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import re
from typing import Union
from . import utils as ut

from . import validation as vl

class TikTokCrawler:
OAUTH_URL = "https://open.tiktokapis.com/v2/oauth/token/"
Expand Down Expand Up @@ -104,10 +104,10 @@ def _process_request(
return err
else:
response_json = response.json()
response_json["search_key"] = search_key
response_json["queried_date"] = queried_date
return response_json
pass
res_json = vl.validate_urls(response_json)
res_json["search_key"] = search_key
res_json["queried_date"] = queried_date
return res_json

def query_videos(
self,
Expand Down Expand Up @@ -164,7 +164,7 @@ def make_csv(
fields = self.FIELDS.split(",") + ["search_key", "queried_date"]

if not data_dir:
data_dir = os.path.join(os.getcwd(), "Data")
data_dir = os.path.join(os.getcwd(), "Data", "video_data")
os.makedirs(data_dir, exist_ok=True)

if not isinstance(data, list):
Expand Down Expand Up @@ -193,16 +193,18 @@ def merge_all_data(self, data_dir: str = None, file_name: str = None) -> None:
Merges all the csv files in the Data folder.
"""
if not data_dir:
data_dir = os.path.join(os.getcwd(), "Data")
data_dir = os.path.join(os.getcwd(), "Data", "video_data")
if not file_name:
file_name = (
f"merged_data_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
)
file_name = "video_list.csv"
# file_name = (
# f"merged_data_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
# )

all_files = glob.glob(os.path.join(data_dir, "*.csv"))
file_path = os.path.join(os.getcwd(), "Data", file_name)

with open(
os.path.join(data_dir, file_name), "w", newline="", encoding="utf-8"
os.path.join(file_path), "a", newline="", encoding="utf-8"
) as fout:
writer = csv.writer(fout)
header_saved = False
Expand Down
13 changes: 13 additions & 0 deletions tt_crawl/validation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import requests


def validate_urls(response_json: dict) -> dict:
EMBED_URL = "https://www.tiktok.com/embed/"

for video in response_json["data"]["videos"]:
url = EMBED_URL + str(video["id"])
res = requests.get(url)
if res.status_code == 400:
response_json["data"]["videos"].remove(video)

return response_json

0 comments on commit ea38ccd

Please sign in to comment.