#

web-scraping

Here are 2,626 public repositories matching this topic...

lorien / awesome-web-scraping

List of libraries, tools and APIs for web scraping and data processing.

Updated Nov 30, 2021
Makefile

autoscraper

alirezamika / autoscraper

Sponsor

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

python crawler machine-learning scraper automation ai scraping artificial-intelligence web-scraping scrape webscraping webautomation

Updated Feb 3, 2021
Python

apify-js

apify / apify-js

Open

Update main examples to include DOM manipulation

1

mtrunkat commented Sep 17, 2019

Main examples at Apify SDK webpage, Github repo and CLI templates should demonstrate how to manipulate with DOM and retrieve data from it.

Also add one example of scraping with Apify SDK + jQuery to https://sdk.apify.com/docs/examples/basiccrawler

Feedback from: https://medium.com/better-programming/do-i-need-python-scrapy-to-build-a-web-scraper-7cc7cac2081d

I lost an hour trying to make

Read more

good first issue

Open

Improve error messages

1

Open

Handle ENOMEM gracefully in memory snapshotter in AutoscaledPool

1

Find more good first issues

php-curl-class / php-curl-class

PHP Curl Class makes it easy to send HTTP requests and integrate with web APIs

Updated Nov 27, 2021
PHP

mherrmann / selenium-python-helium

Selenium-python but lighter: Helium is the best Python library for web automation.

python firefox chrome webdriver selenium python3 web-scraping helium web-automation selenium-python

Updated Sep 3, 2021
Python

lorien / grab

Web Scraping Framework

python framework spider asynchronous network http-client web-scraping pycurl urllib3

Updated Feb 22, 2021
Python

go-rod / rod

A Devtools driver for web automation and scraping

testing go golang scraper automation web chrome-devtools headless devtools crawling web-scraping cdp chrome-headless rod chrome-devtools-protocol devtools-protocol gorod

Updated Nov 26, 2021
Go

codingforentrepreneurs / 30-Days-of-Python

Learn Python for the next 30 (or so) Days.

python api flask automation tutorial csv jupyter rest-api selenium pandas python3 web-scraping selenium-webdriver fastapi

Updated Nov 8, 2021
HTML

justmarkham / DAT8

General Assembly's 2015 Data Science course in Washington, DC

python data-science machine-learning natural-language-processing course clustering naive-bayes linear-regression scikit-learn jupyter-notebook pandas data-visualization web-scraping data-analysis ensemble-learning logistic-regression decision-trees regular-expressions data-cleaning model-evaluation

Updated Apr 18, 2016
Jupyter Notebook

tidyverse / rvest

Simple web scraping for R

html r web-scraping

Updated Oct 28, 2021
R

snoop

snooppr / snoop

Snoop — инструмент разведки на основе открытых данных (OSINT world)

Updated Dec 3, 2021
Python

x4nth055 / pythoncode-tutorials

The Python Code Tutorials

python python-tutorials machine-learning natural-language-processing computer-vision text-classification tutorials python3 web-scraping face-detection scapy network-analysis network-programming programming-tutorial ethical-hacking network-security socket-programming scapy-tutorials

Updated Nov 25, 2021
Jupyter Notebook

vprusso / youtube_tutorials

Sponsor

Collection of scripts corresponding to LucidProgramming YouTube tutorials

python python3 web-scraping youtube-tutorial python-tutorial ctci-solutions lucidprogramming python3-tutorial technical-interview

Updated Feb 10, 2021
Python

DataHenHQ / till

DataHen Till is a companion tool to your existing web scraper that instantly makes it scalable, maintainable, and more unblockable, with minimal code changes on your scraper. Integrates with any scraper in 5 minutes.

crawler scraper scraping mitm proxy-server web-scraping man-in-the-middle

Updated Dec 4, 2021
Go

juancarlospaco / faster-than-requests

Faster requests on Python 3

Updated Sep 13, 2021
Nim

postmodern / spidr

Sponsor

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

ruby crawler scraper web spider web-crawler web-scraper web-scraping web-spider spider-links

Updated Jun 23, 2021
Ruby

dinubs / coolqlcool

Nextjs server to query websites with GraphQL

javascript graphql schema nextjs web-scraping

Updated Aug 13, 2021
JavaScript

intoli / user-agents

A JavaScript library for generating random user agents with data that's updated daily.

javascript user-agent random randomization navigator web-scraping browsers browser-automation user-agent-spoofer

Updated Dec 4, 2021
JavaScript

alecxe / scrapy-fake-useragent

Random User-Agent middleware based on fake-useragent

python web-scraping scrapy

Updated Sep 17, 2020
Python

A9T9 / RPA

UI.Vision: Open-Source RPA Software (formerly Kantu) - Modern Robotic Process Automation with Selenium IDE++

opencv automation webassembly web-scraping autohotkey browser-extension imacros selenium-ide browser-automation visual-recognition sikulix web-automation ui-tests uipath data-driven-tests

Updated Sep 4, 2021
JavaScript

rushter / selectolax

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).

css python parser html5 web-scraping modest-engine

Updated Nov 17, 2021
Cython

AlexMathew / scrapple

A framework for creating semi-automatic web content extractors

python crawler tutorial extractor scraping web-scraper selector css-selector web-scraping scrapy scrapers beautifulsoup xpath-expression lxml selector-expression

Updated Oct 24, 2020
Python

VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.

web-crawler web-scraping hacktoberfest web-spider focused-crawler domain-specific-search web-search

Updated Nov 22, 2021
Java

adbar / trafilatura

Open

Test trafilatura on further web pages and report bugs

1

adbar commented Jan 9, 2020

I have mostly tested trafilatura on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction doesn't work so far.

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in [xpaths.py](https://github.com

Read more

good first issue up for grabs

austinoboyle / scrape-linkedin-selenium

Open

Certifications return empty []

2

anntdiv commented Jun 10, 2021

Hello,
Thanks for new update in personal_info section,
I found out that the attribute 'certifications' return empty list []
Test url: https://www.linkedin.com/in/an-nguyen-9b3248122/
Results:
`{'personal_info': {'name': 'An Nguyen',
'headline': 'Data Scientist/Machine Learning Engineer',
'company': 'PERSOL PROCESS & TECHNOLOGY CO., LTD.',
'school': 'National Chiao Tung University',

Read more

help wanted good first issue

Open

Companyscraper doesn't work and returns error 'NoneType'

3

Open

Scrape linkedin posts

3

Find more good first issues

jaebradley / basketball_reference_web_scraper

NBA Stats API via Basketball Reference

python nba web-scraper web-scraping basketball-reference

Updated Dec 4, 2021
HTML

infinitbyte / gopa

[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn

lightweight elasticsearch crawler spider web-crawler scraping crawling web-scraping web-spider

Updated May 19, 2021
Go

sangaline / wayback-machine-scraper

A command-line utility and Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.

python web-scraping command-line-tool wayback-machine wayback-archiver archive-dot-org

Updated Feb 15, 2021
Python

yusuzech / r-web-scraping-cheat-sheet

Guide, reference and cheatsheet on web scraping using rvest, httr and Rselenium.

r web-scraping cheatsheet webscraping rvest scrape-websites httr rselenium

Updated Oct 25, 2021
R

web-scraping

je-suis-tm / web-scraping

Detailed web scraping tutorials for dummies with financial data crawlers on Reddit WallStreetBets, CME (both options and futures), US Treasury, CFTC, LME, MacroTrends, SHFE and alternative data crawlers on Tomtom, BBC, Wall Street Journal, Al Jazeera, Reuters, Financial Times, Bloomberg, CNN, Fortune, The Economist

web-scraper web-scraping newsletter reuters bloomberg futures web-scrapers scrapper financial-data news-websites data-scraping news-scraper futures-historical-data data-scraper sraping python-web-scraper financial-times options-data wall-street-journal wallstreetbets

Updated Jun 28, 2021
Python

Improve this page

Add a description, image, and links to the web-scraping topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the web-scraping topic, visit your repo's landing page and select "manage topics."