Scraping dynamic pages with Python, Playwright and AWS Lambda

Scraping dynamic pages with Python, Playwright and AWS Lambda

使用 Python、Playwright 和 AWS Lambda 抓取动态网页

If you have ever pointed BeautifulSoup at a modern job board and then wondered why you got only a fraction of the visible listings, welcome to the club. Many of these pages behave like mini frontends: data appears in chunks, the DOM keeps changing, and scrolling is effectively part of the API contract. For this walkthrough, I used the Dev IT Jobs portal as a practical example. This post breaks down a Lambda scraper that survives that behavior. The idea is simple but battle-tested: use Playwright + headless Chromium to trigger dynamic loading, extract records while scrolling, shape the result with Polars, and store snapshots as parquet in S3 partitions. It is serverless, schedule-friendly, and ready for downstream analytics without extra cleanup.

如果你曾经尝试用 BeautifulSoup 抓取现代招聘网站,却发现只能获取到页面显示内容的一小部分,那么欢迎加入这个“受害者俱乐部”。许多现代网页的行为就像小型前端应用:数据分块加载,DOM 结构不断变化,而滚动页面实际上已成为 API 交互的一部分。在本文中,我将以 Dev IT Jobs 门户网站为例进行演示。本文将剖析一个能够应对这种行为的 Lambda 爬虫。其核心思路简单且经过实战检验:使用 Playwright + 无头 Chromium 触发动态加载,在滚动过程中提取记录,利用 Polars 处理数据,并将快照以 Parquet 格式存储在 S3 分区中。这种方案是无服务器的,易于定时调度,且无需额外清理即可直接用于下游分析。

Imports and runtime setup

导入与运行时设置

Packages used in the Lambda:

  • playwright: runs a Chromium browser so JavaScript-rendered cards can be collected.
  • boto3: uploads the final parquet artifact to S3.
  • polars: converts raw records into a dataframe and writes parquet efficiently.
  • pendulum: provides cleaner timestamp handling for metadata and S3 partition keys.
  • aws_lambda_typing: adds explicit types for the Lambda handler contract (optional).

Lambda 中使用的包:

  • playwright:运行 Chromium 浏览器,以便抓取由 JavaScript 渲染的卡片。
  • boto3:将最终的 Parquet 文件上传至 S3。
  • polars:将原始记录转换为数据帧(DataFrame)并高效写入 Parquet。
  • pendulum:提供更简洁的时间戳处理,用于元数据和 S3 分区键。
  • aws_lambda_typing:为 Lambda 处理程序契约添加显式类型(可选)。

Standard-library helpers:

  • logging: emits structured runtime logs for CloudWatch.
  • os: reads environment configuration such as BUCKET_URL.
  • tempfile: writes temporary files to Lambda’s /tmp storage.
  • time: adds short pauses so lazy-loaded DOM elements can render.
  • urllib.parse: parses the bucket name from URL-like configuration values.

标准库辅助工具:

  • logging:为 CloudWatch 输出结构化的运行时日志。
  • os:读取环境变量配置,如 BUCKET_URL。
  • tempfile:将临时文件写入 Lambda 的 /tmp 存储空间。
  • time:添加短暂暂停,以便懒加载的 DOM 元素完成渲染。
  • urllib.parse:从类似 URL 的配置值中解析存储桶名称。

Opening the page and targeting the scrollable list

打开页面并定位可滚动列表

The first step is launching Chromium in headless mode and identifying the actual element that reacts to scroll events. On this page, .joblist-container is where new cards are appended, so scrolling the whole page does not reliably pull in the full dataset.

第一步是以无头模式启动 Chromium,并识别出真正响应滚动事件的元素。在该页面中,.joblist-container 是新卡片追加的位置,因此滚动整个页面并不能可靠地获取完整数据集。

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        args=[
            "--disable-gpu",
            "--disable-dev-shm-usage",
            "--disable-setuid-sandbox",
            "--no-sandbox",
            "--single-process",
        ],
    )
    page = browser.new_page()
    page.goto(url, wait_until="networkidle")
    container = page.locator(".joblist-container").first

Those Chromium flags are not “nice-to-have tuning”, but rather we need to set them so that playwright works correctly in AWS Lambda: --disable-gpu avoids hardware acceleration paths that do not help here, --disable-dev-shm-usage steers Chromium away from shared-memory assumptions that can be too tight in serverless containers, --disable-setuid-sandbox and --no-sandbox help when sandbox initialization fails in restricted environments, --single-process also reduced startup flakiness. Without this set, the function was far more likely to fail before scraping anything useful.

这些 Chromium 参数并非“锦上添花”,而是为了让 Playwright 在 AWS Lambda 中正常工作所必需的:--disable-gpu 避免了此处无用的硬件加速路径;--disable-dev-shm-usage 让 Chromium 避开在无服务器容器中可能过于受限的共享内存假设;--disable-setuid-sandbox--no-sandbox 有助于解决受限环境中沙箱初始化失败的问题;--single-process 也减少了启动时的不稳定性。如果不设置这些参数,函数在抓取到任何有用数据之前就失败的概率会高得多。

Scraping records while the page reveals more items

在页面加载更多项目时抓取记录

The extraction loop does one boring but powerful thing on repeat: wait a moment, read visible li cards, save what matters, scroll, and repeat. Dynamic pages frequently repaint old nodes, so handled_jobs is a must-have to avoid collecting duplicates when the same listing shows up again after a re-render.

提取循环重复做着一件枯燥但强大的事情:等待片刻,读取可见的 li 卡片,保存重要信息,滚动,然后重复。动态页面经常会重绘旧节点,因此必须使用 handled_jobs 来避免在重新渲染后出现重复抓取同一条目的情况。

postings = []
found_last = False
handled_jobs = set()

while not found_last:
    time.sleep(0.5) # Allow site to load new data after scroll
    li_items = container.locator("li")
    for i in range(li_items.count()):
        item = li_items.nth(i)
        # Sentinel block that appears at the end of the list
        title = item.locator(".jobteaser-name-header").first.text_content().strip()
        if title == "Haven't found your dream Data job yet?":
            found_last = True
            break
        # Gather any fields you care about.
        postings.append(...)
        handled_jobs.add(...) # Usually the job URL or another stable identifier
    page.eval_on_selector(
        "div.joblist-container div div",
        "el => { el.scrollTop += 288 }",
    )

The sentinel title (Haven’t found your dream Data job yet?) gives a deterministic exit and avoids guesswork like “scroll exactly N times and hope for the best.” One extra guardrail worth adding I also recommend a hard cap on scroll iterations. If the markup changes and the sentinel disappears, the function still exits cleanly instead of looping until timeout.

哨兵标题(“Haven’t found your dream Data job yet?”)提供了一个确定性的退出条件,避免了诸如“滚动 N 次并祈祷成功”之类的猜测。我还建议增加一个额外的防护措施:对滚动次数设置硬上限。如果页面标记发生变化导致哨兵消失,函数仍能干净地退出,而不是一直循环直到超时。

Writing parquet and uploading to S3

写入 Parquet 并上传至 S3

After scraping, the handler converts the payload into a Polars dataframe, normalizes column types as strings, writes parquet to /tmp, and uploads the file to a partitioned S3 key. This makes downstream ingestion easier, since each Lambda run produces a compact file in a predictable location.

抓取完成后,处理程序将有效载荷转换为 Polars 数据帧,将列类型标准化为字符串,将 Parquet 写入 /tmp,并将文件上传到分区后的 S3 路径。这使得下游的数据摄取变得更容易,因为每次 Lambda 运行都会在可预测的位置生成一个紧凑的文件。

def lambda_handler(event: EventBridgeEvent, context: Context) -> dict[str, Any]:
    site_data = _parse_site(url="https://devitjobs.uk/jobs/Data/all")
    df = pl.DataFrame(site_data)
    df = df.select(pl.all().cast(pl.String)) # Casting to string for simplicity
    
    date = pendulum.now()
    file_name = f"{date.timestamp()}.parquet"
    file_path = f"{tempfile.gettempdir()}/{file_name}"
    df.write_parquet(file_path)
    
    bucket_name = urlparse(_BUCKET_URL).netloc
    client = boto3.client(service_name="s3")
    
    # Each directory creates a partition when you use Glue Crawler.
    key = f"dev_it_jobs/postings/year={date.year}/month={date.month}/{file_name}"
    client.upload_file(Filename=file_path, Bucket=bucket_name, Key=key)
    
    return {"statusCode": 200, "message": "Dev IT Jobs handled correctly"}

Parquet keeps storage efficient and query-friendly, and the date partitioning keeps recurring snapshots tidy for Athena, Spark, or any ETL flow you throw at it.

Parquet 保持了存储的高效性和查询友好性,而日期分区则使定期生成的快照对于 Athena、Spark 或任何 ETL 流程来说都井井有条。

Practical notes for dynamic pages in Lambda

Lambda 中处理动态网页的实用建议

If I had to compress this whole post into one sentence, it would be this: dynamic scraping in Lambda is mostly about controlling browser behavior, not parsing HTML faster. Once the browser is stable, the rest becomes a clean data-engineering loop.

如果要把整篇文章压缩成一句话,那就是:在 Lambda 中进行动态抓取,关键在于控制浏览器行为,而不是更快地解析 HTML。一旦浏览器运行稳定,剩下的工作就变成了一个清晰的数据工程循环。