12.8.23
Objectives
- To collect data from online platforms lacking public APIs. This entails documenting some of Selenium’s basic library functions.
Key Concepts
- We use Grailed as an example of the marketplaces from which I’ll scrape data, later indexing it in a buying and reselling program I’m developing.
Tools and Techniques
- Selenium, a browser automation tool, facilitates user interaction emulation. It’s the holy grail for developing web bots.
- Pandas, an open-source Python library, excels in data analytics and manipulation.
- Wappalyzer, a browser extension, identifies technology stacks on websites. Useful for understanding web page structures and anticipating anomalies.
Practical Exercises
- Authentication: This exercise demonstrates user login authentication while avoiding detection by JavaScript trackers. For secure login, we’ll create a .env file in our project’s root directory. Ensure this file is included in your .gitignore when uploading source code, as it poses a security risk.
Create the .env file and add environment variables:
GRAILED_USERNAME="username_goes_here"
GRAILED_PASSWORD="password_goes_here"
Create “auth.py” and import dependencies and modules. Create variables for program access.
import os
from dotenv import load_dotenv
# Env Variables
load_dotenv('.env')
username = os.getenv('GRAILED_USERNAME')
password = os.getenv('GRAILED_PASSWORD')
Functions are defined below. For brevity, dependencies and modules needed are not listed but will likely be auto-imported in an IDE.
# Initiate headless chrome driver
def init_driver(headless=False):
chrome_options = webdriver.ChromeOptions()
if headless:
chrome_options.add_argument('--headless')
return webdriver.Chrome(options=chrome_options)
The driver, the interface for browser interactions, can be any browser driver. Chrome is used here for its popularity.
In the login_grailed() function, we target elements for actions. Initially, I used time.sleep() but found a native Selenium function more universally applicable.
EC.element_to_be_clickable(): This function waits for an element to be ready in the DOM before acting on it.
Using try & catch statements allows for easier debugging through exception messages in the console.
def login_grailed(url, driver):
sign_in_url = url.rstrip('/') + "/users/sign_up/"
driver.get(sign_in_url)
try:
# Click on the login link
wait = WebDriverWait(driver, 10)
login_link = wait.until(EC.presence_of_element_located((By.XPATH, "//a[@href='/users/sign_up' and text()='Login']")))
login_link.click()
except TimeoutException:
print("Login link not available")
driver.quit()
return
try:
# Click on the login button
login_button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Log in with Email']")))
login_button.click()
except TimeoutException:
print("Login button not available")
driver.quit()
return
try:
# Input username
actions = ActionChains(driver)
email_input_field = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID,'email')))
actions.move_to_element(email_input_field).click().perform()
for character in username:
actions.send_keys(character).perform()
time.sleep(0.15)
except TimeoutException:
print("Email input field not available")
driver.quit()
return
try:
# Input password
password_input_field = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, 'password')))
actions.move_to_element(password_input_field).click().perform()
for character in password:
actions.send_keys(character).perform()
time.sleep(0.012)
except TimeoutException:
print("Password input field not available")
driver.quit()
return
try:
# Submit login
time.sleep(3)
actions.send_keys(Keys.RETURN).perform()
time.sleep(2)
# Add check for successful login here
except Exception as e:
print("Error submitting login: ", e)
driver.quit()
return
Key Takeaways
- Secure handling of environment variables for credential passing.
- Circumventing JS trackers with natural, gentle form handling.
- Using and saving cookie files to reduce processing time.
- Identifying dynamically created JS framework elements, increasing code robustness by using specific XPATH variables.
Questions and Curiosities
- Exploration of no-code frameworks that automate user behavior replication in scripting.
Additional Resources
- Selenium Avoid Bot Detection: Tricks to go unnoticed despite bots being easily detected today.
- Hackernoon: The Web Scraping Anti-Bot Matrix Guide
Personal Reflection
- The ease of deceiving webpages into treating my script as a regular user was surprising.
- Writing basic manipulation code is time-consuming; I aspire to refactor this into modular form, essential for the various marketplaces in my project.