14
submitted 1 month ago by [email protected] to c/[email protected]
  • I was applying to a job, and then I had to answer a question about web scraping, which I'm not familiar with. I answered all the other questions with no issue, so I decided might as well put in the effort to learn the basics and see if I can do it in a day.
  • Yes, it was *somewhat * easier than I expected, but I still had to watch like 4 YouTube videos and read a bunch of reddit and stack overflow posts.
  • I got the code working, but I decided to run it again to double-check. It stopped working. Not sure why.
  • Testing is also annoying because the "web page" is a google doc and constantly reloads or something. It takes forever to get proper results from my print statements.
  • I attached an image with the question. I haven't heard back from them, and I've seen other people post what I think might be this exact question online, so hopefully I'm not doing anything illegal.
  • At this point, I just want to solve it. Here's the code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

def createDataframe(url): #Make the data easier to handle
    #Get the page's html data using BeautifulSoup
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    #Extract the table's headers and column structure
    table_headers = soup.find('tr', class_='c8')
    table_headers_titles = table_headers.find_all('td')
    headers = [header.text for header in table_headers_titles]

    #Extract the table's row data
    rows = soup.find_all('tr', class_='c4')
    row_data_outer = [row.find_all('td') for row in rows]
    row_data = [[cell.text.strip() for cell in row] for row in row_data_outer]

    #Create a dataframe using the extracted data
    df = pd.DataFrame(row_data, columns=headers)
    return df

def printMessage(dataframe): #Print the message gotten from the organised data
    #Drop rows that have missing coordinates
    dataframe = dataframe.dropna(subset=['x-coordinate', 'y-coordinate'], inplace=True)

    #Convert the coordinate columns to integers so they can be used
    dataframe['x-coordinate'] = dataframe['x-coordinate'].astype(int)
    dataframe['y-coordinate'] = dataframe['y-coordinate'].astype(int)

    #Determine how large the grid to be printed is
    max_x = int(dataframe['x-coordinate'].max())
    max_y = int(dataframe['y-coordinate'].max())

    #Create an empty grid
    grid = np.full((max_y + 1, max_x + 1), " ")

    #Fill the grid with the characters using coordinates as the indices
    for _, row in dataframe.iterrows():
        x = row['x-coordinate']
        y = row['y-coordinate']
        char = row['Character']
        grid[y][x] = char
    for row in grid:
        print("".join(row))

test = 'https://docs.google.com/document/d/e/2PACX-1vQGUck9HIFCyezsrBSnmENk5ieJuYwpt7YHYEzeNJkIb9OSDdx-ov2nRNReKQyey-cwJOoEKUhLmN9z/pub'
printMessage(createDataframe(test))

My most recent error:

C:\Users\User\PycharmProjects\dataAnnotationCodingQuestion\.venv\Scripts\python.exe C:\Users\User\PycharmProjects\dataAnnotationCodingQuestion\.venv\app.py 
Traceback (most recent call last):
  File "C:\Users\User\PycharmProjects\dataAnnotationCodingQuestion\.venv\app.py", line 50, in <module>
    printMessage(createDataframe(test))
  File "C:\Users\User\PycharmProjects\dataAnnotationCodingQuestion\.venv\app.py", line 30, in printMessage
    dataframe['x-coordinate'] = dataframe['x-coordinate'].astype(int)
                                ~~~~~~~~~^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not subscriptable

Process finished with exit code 1

you are viewing a single comment's thread
view the rest of the comments
[-] [email protected] 5 points 1 month ago* (last edited 1 month ago)

Your dataframe is None because you're dropping the rows in place. Either set inplace=False in that dropna call, or remove the reassignment to dataframe.

[-] [email protected] 2 points 1 month ago

Thanks, I set inplace=False like you suggested. I thought setting it to true meant that it modified the original dataframe. Why does it work with false?

[-] [email protected] 4 points 1 month ago

Inplace=true will modify the original dataframe, while false will return the result in a new dataframe. When you assign a variable the result of in place=true (dataframe = dataframe...(I place=true)), you've overwritten the data frame with the result of that call (which is nothing, it happened in place)

load more comments (1 replies)
this post was submitted on 01 May 2025
14 points (88.9% liked)

Python

7121 readers
5 users here now

Welcome to the Python community on the programming.dev Lemmy instance!

📅 Events

PastNovember 2023

October 2023

July 2023

August 2023

September 2023

🐍 Python project:
💓 Python Community:
✨ Python Ecosystem:
🌌 Fediverse
Communities
Projects
Feeds

founded 2 years ago
MODERATORS