Definition:
Before understanding Python web scraping, let’s first have a look at what is webscrapping. In simple terms, web scraping is the automated process of extracting something from a website. This automation helps to process the HTML of a web page and manipulate it to convert the web page into a different format, copy it into a database, and many different activities. Web Scraping can be used to access the World Wide Web through HyperText Transfer Protocol. Additionally, this can also be done via Web Browsers. Web Scraping can be done manually also. In Python webscraping, we use Python Selenium for web scraping purposes. Web Scraping is sometimes done for illegal purposes too. If a business relies on its content and business model, scraping can sometimes make them a huge financial loss.
Techniques of Web Scraping:
Accordingly, there are many techniques for web scraping. For example, some of them are listed below–
- Human Copy Paste: Human Copy Paste is the simplest method of web scraping. This method indulges the manual copy and pasting of data from a web page into a spreadsheet. This can sometimes be the best method of web scraping too. This is because some websites have very thick protection towards automation.
- Text Pattern Matching: This simple, however compelling web scraping method is based on the UNIX grep command or Regular Expression or known as RegEx, of programming languages like Perl and Python.
- HTTP Programming: Using socket programming, we can post HTTP requests to a remote webserver to retrieve the static or dynamic web pages.
- HTML Parsing: As we know, many websites have pages generated dynamically from a source like a database. Here, the page will have a template. Likewise, a program that detects such templates is called a wrapper. So, the information from a specific script can be extracted. This is mainly used in data mining.
- DOM Parsing: As we know that Internet Explorer and Mozilla Firefox are some full-fledged web browsers. By embedding these web browsers, programs can extract content generated by client-side scripts. XPath can be used in this method.
Some uses of Web Scraping :
Simultaneously, there are many uses of web scraping. For instance, some of them are listed below–
- Web Data Integrating
- Data Mining
- Detecting change in a website
- Online presence checking
- Weather Data Monitoring
There are 6 types of testing:
- Acceptance testing
- Functional testing
- Performance testing
- Regression testing
- Test-driven development (TDD)
- Behavior-driven development (BDD)
Languages that support Selenium:
- Java
- Python
- C#
- Ruby
- JavaScript
- Kotlin
Introduction to Selenium
Selenium is a tool that helps us to automate web browsers. Selenium is used for testing web applications. However, it can do much more stuff. It opens up the web browser and does things a normal human being would do. Some of those things include clicking buttons, searching for information, filling the input field even though it is a handy tool. If you scrape a website frequently and for malicious purposes, your IP address may be banned as it is against most of the websites’ terms and services.
A. Installation
Requirements:
Python 2 should be 2.7.9 or higher, OR Python 3 should be 3.4.0 or higher. Pip should also be installed.
Run the below command in terminal.
pip install selenium
Drivers:
Browsers require selenium WebDrivers to an interface. Different browsers have different drivers. To illustrate, some of the most popular drivers are given below.
Chrome | Chrome Driver |
Edge | Edge Driver |
Firefox | Gecko Driver |
Opera | Opera Driver |
Safari | Safari WedDriver is built-in |
B. Getting Started
Simple Use :
As you install web driver and pip module, you can start using selenium with python as below:
from selenium import webdriver from selenium.webriver.common.keys import Keys # Driver specification and executable path driver = webdriver.Chrome(executable_path="PATH TO WEBDRIVER") # Redirects to the specified website driver.get("violet-cat-415996.hostingersite.com") #specifies the input element in the website search_element = driver.find_element_by_id("is-search-input-0") text = "Turtle" # sends the keys to the input field search.send_keys("Turtle")
The above code will browse into https://copyassignment.com/, and it will search Turtle in the search box. You can replace the value of the “text” variable.
Writing tests with selenium:
As we know, selenium is mostly used for writing test cases. However, selenium itself does not come with testing tools. For instance, we should use python modules like pytest or nose.
On the contrary, we will be using the unittest module. The following code for Python Web Scraping will check the functionality of search on python.org.
import unittest
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
class PythonOrgSearch(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
def test_search_in_python_org(self):
driver = self.driver
driver.get("http://www.python.org")
self.assertIn("Python", driver.title)
elem = driver.find_element_by_name("q")
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
def tearDown(self):
self.driver.close()
if __name__ == "__main__":
unittest.main()
Now, if you want to run the above code from shell, follow these steps.
python test_python_org_search.py
.
----------------------------------------------------------------------
Ran 1 test in 15.566s
OK
We can clearly see that the test was successful. In addition, you can even use this in Jupyter or iPython.
Selenium Remote WebDriver for Python Selenium:
If you want to use the remote web driver, then you should have the selenium server running. After the server is running, you can use some of the following examples.
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
driver = webdriver.Remote(
command_executor='http://127.0.0.1:4444/wd/hub',
desired_capabilities=DesiredCapabilities.CHROME)
driver = webdriver.Remote(
command_executor='http://127.0.0.1:4444/wd/hub',
desired_capabilities=DesiredCapabilities.OPERA)
driver = webdriver.Remote(
command_executor='http://127.0.0.1:4444/wd/hub',
desired_capabilities=DesiredCapabilities.HTMLUNITWITHJS)
As we know that the expected capability is a dictionary. Instead, you can specify the values as:
driver = webdriver.Remote(
command_executor='http://127.0.0.1:4444/wd/hub',
desired_capabilities={'browserName': 'htmlunit',
'version': '2',
'javascriptEnabled': True})
C. Navigation
Interaction with HTML:
<input type="text" name="inp" id="fill" class="sam">
In the above case, if you want to navigate the HTML element, then you can use any of the following.
scr = driver.find_element_by_name("inp")
scr = driver.find_element_by_id("fill")
scr = driver.find_element_by_class("sam")
scr = driver.find_element_by_xpath("//input[@id='fill']")
scr = driver.find_element_by_css_selector("input#fill")
NOTE: The ID of an element should be unique. This means that no other element in the webpage should have the same ID. If there is a duplicate ID, selenium will recognize the first one.
Sending Text to input field:
Furthermore, for sending a text to the input field, you should import one more thing. That is likely,
from selenium.webdriver.common.keys import Keys
Now, you can use the above import in the following manner.
scr.send_keys("Sample Text")
Similarly, the table below shows the code for entering the keys on the keyboard.
Key Codes:
ADD | ALT | ARROW_DOWN |
ARROW_LEFT | ARROW_RIGHT | ARROW_UP |
BACKSPACE | BACK_SPACE | CANCEL |
CLEAR | COMMAND | CONTROL |
DECIMAL | DELETE | DIVIDE |
DOWN | END | ENTER |
EQUALS | ESCAPE | F1 |
F10 | F11 | F12 |
F2 | F3 | F4 |
F5 | F6 | F7 |
F8 | F9 | HELP |
HOME | INSERT | LEFT |
LEFT_ALT | LEFT_CONTROL | LEFT_SHIFT |
META | MULTIPLY | NULL |
NUMPAD0 | NUMPAD1 | NUMPAD2 |
NUMPAD3 | NUMPAD4 | NUMPAD5 |
NUMPAD6 | NUMPAD7 | NUMPAD8 |
NUMPAD9 | PAGE_DOWN | PAGE_UP |
PAUSE | RETURN | RIGHT |
SEMICOLON | SEPARATOR | SHIFT |
SPACE | SUBTRACT | TAB |
Drag and Drop:
You can use this feature by either moving the specified element or on to another element. For using the drag and drop feature you can use the following code of Python Selenium.
from selenium.webdriver import ActionChains
element = driver.find_element_by_name("source")
target = driver.find_element_by_name("target")
action_chains = ActionChains(driver)
action_chains.drag_and_drop(element, target).perform()
Moving from frame to frame:
Nowadays, most websites have frames. So, there is a different method to switch between frames on a web page. The method used is “switch_to_window.” For example, le the HTML code be like
<a href="somewhere.html" target="windowName">Click here to open a new window</a>
Then, the python code will be
driver.switch_to_window("windowName")
Popups:
Selenium has built-in support to handle popup boxes. There is an example of a popup handling below:
alert = driver.switch_to.alert
Cookies:
Additionally, selenium also has cookie uses. You can understand the cookies in selenium from the code below.
# Navigating to the domain
driver.get("http://www.example.com")
# Setting the cookie
cookie = {‘name’ : ‘aakriti’, ‘value’ : ‘girl’}
driver.add_cookie(cookie)
# This will output all the available cookies
driver.get_cookies()
Fun Fact: We know that Selenium with Python is a popular way of testing web apps, but selenium supports many different languages. Some of them are C#, Java, C++, etc.
D. Locating
There are many ways to locate elements on a web page in Python Web scraping using Python Selenium. You can use it according to the environment you want. The methods to locate elements on a web page are:
- find_element_by_id
- find_element_by_name
- find_element_by_xpath
- find_element_by_link_text
- find_element_by_partial_link_text
- find_element_by_tag_name
- find_element_by_class_name
- find_element_by_css_selector
If you want to find many elements on a page then you can use the following methods.
- find_elements_by_name
- find_elements_by_xpath
- find_elements_by_link_text
- find_elements_by_partial_link_text
- find_elements_by_tag_name
- find_elements_by_class_name
- find_elements_by_css_selector
Note: The above methods will return a list.
These are used in the example below.
from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, '//input[text()="Some text"]')
driver.find_elements(By.XPATH, '//input')
Moreover, there are some attributes available for By class. They are listed below.
ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
NAME = "name"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"
1. Locating an element by ID
We should use this when we know the id of an element. If the id provided is wrong, there will be an error called NoSuchElementException. Additionally, if we use this strategy, then selenium will recognize the first element with the ID.
To illustrate let the source of the page be like:
<html>
<body>
<form id="loginForm">
<input name="username" type="text">
<input name="password" type="password">
<input name="continue" type="submit" value="Login">
</form>
</body>
<html>
Here, the form element can be located like:
form = driver.find_element_by_id('loginForm')
2. Locating an element by Name
We should use this when we know the name of an element. If the name provided is wrong, there will be an error called NoSuchElementException. Additionally, if we use this strategy, then selenium will recognize the first element with the name. This is the same as above.
For instance, let the source of a page be:
<html>
<body>
<form id="loginForm">
<input name="username" type="text">
<input name="password" type="password">
<input name="continue" type="submit" value="Login">
<input name="continue" type="button" value="Clear">
</form>
</body>
<html>
Therefore, the python code to locate the username and password in Python Web scraping using Python Selenium will be as follows.
username = driver.find_element_by_name('username')
password = driver.find_element_by_name('password')
Moreover, to locate login button, we use the code
continue = driver.find_element_by_name('continue')
Note: Selenium will locate the login button because it occurs before clear.
3. Locating with XPath
XPath is a language that is used to locate nodes in XML documents. XPath supports the simple locating methods by id or name attributes and extends them by opening up all sorts of new possibilities, such as locating the third checkbox on the page.
One of the main reasons for using XPath is when you don’t have a suitable id or name attribute for the element you wish to locate.
You can use XPath to either locate the element in absolute terms or relative to an element that does have an id or name attribute.
XPath locators can also be used to specify elements via attributes other than id and name.
By finding a nearby element with an id or name attribute, you can locate your target element based on the relationship.
Let the source of a page be:
<html>
<body>
<form id="loginForm">
<input name="username" type="text" />
<input name="password" type="password" />
<input name="continue" type="submit" value="Login" />
<input name="continue" type="button" value="Clear" />
</form>
</body>
<html>
So, the code to locate the form element will be
login_form = driver.find_element_by_xpath("/html/body/form[1]")
login_form = driver.find_element_by_xpath("//form[1]")
login_form = driver.find_element_by_xpath("//form[@id='loginForm']")
Consequently, the codes to locate username element could be:
username = driver.find_element_by_xpath("//form[input/@name='username']")
username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")
username = driver.find_element_by_xpath("//input[@name='username']")
Likewise, the clear element can be locate as follows.
clear_button = driver.find_element_by_xpath("//input[@name='continue'][@type='button']")
clear_button = driver.find_element_by_xpath("//form[@id='loginForm']/input[4]")
4. Locating Hyperlinks by Link Text
You should use this strategy if the link text is inside the anchor tag.
When we use this method, we get the link text of the element specified. If there is no element like the specified one, there will be an error called NoSuchElementException.
For instance, let the source of the page be like this.
<html>
<body>
<p>Are you sure you want to do this?</p>
<a href="next.html">Next</a>
<a href="previous.html">Back</a>
</body>
<html>
Subsequently, the next.html and previous.html can be located as.
continue_link = driver.find_element_by_link_text('Next')
continue_link = driver.find_element_by_partial_link_text('Back')
5. Locating Elements by Tag Name
You should use this strategy if the desired content is to be located by tag name. When we use this method, we get the first element with the matching class name attribute. If there is no element like the specified one, there will be an error called NoSuchElementException.
For example, let the source be:
<html>
<body>
<p class="cont">lorem ipsum.</p>
</body>
<html>
Here, the “p” element can be located as
loc = driver.find_element_by_class_name('cont')
6. Localting Elements by CSS Selector
You can use this method when you want to locate with CSS selector syntax. Likewise, if there is no matching element, there will be an error. The error, as always, is NoSuchElementException.
To illustrate, let the html code be:
<html>
<body>
<p class="cont">Lorem ipsum.</p>
</body>
<html>
Likely, the code to locate “p” tag will be:
content = driver.find_element_by_css_selector('p.cont')
E. Wait
Nowadays, many websites are using the AJAX technique. In this technique, the elements of the page load at different times. In contrast, this will create a problem many times. When a locate function runs, but the element is not loaded, there will be an error. The error is called ElementNotVisibleException. Now, to solve this issue, we have a feature in selenium called waits. Nonetheless, we can add some intervals between locating elements or performing actions on them using this feature.
The Selenium WebDriver provides two types of waits are Implicit Waits and Explicit Waits. Similarly, they are explained below.
1. Explicit Waits
In this wait, we make the program wait for some time until the specified condition is fulfilled. The highest wait is time.sleep(), which can be used by importing the python’s built-in module called “time.” To be precise, one of the ways to do this is using WebDriverWait and ExpectedCondition together.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)
finally:
driver.quit()
If we run the above python code, it will wait for a maximum of 10 seconds for finding the matching element to be found. If there is no element found until the given time, then TimeoutException will occur. The output will be either true or null if success and fail, respectively.
A. Conditions Expected
While automating the web in Python Web scraping using Python Selenium there are some common conditions that are expected. Some of them are listed below.
- title_is.
- title_contains.
- presence_of_element_located.
- visibility_of_element_located.
- visibility_of.
- presence_of_all_elements_located.
- text_to_be_present_in_element.
- text_to_be_present_in_element_value.
- frame_to_be_available_and_switch_to_it.
- invisibility_of_element_located.
- element_to_be_clickable.
- staleness_of.
- element_to_be_selected.
- element_located_to_be_selected.
- element_selection_state_to_be.
- element_located_selection_state_to_be.
- alert_is_present.
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID, 'someid')))
B. Custom Wait Conditions
class element_has_css_class(object):
"""An expectation for checking that an element has a particular css class.
locator - used to find the element
returns the WebElement once it has the particular css class
"""
def __init__(self, locator, css_class):
self.locator = locator
self.css_class = css_class
def __call__(self, driver):
element = driver.find_element(*self.locator) # Finding the referenced element
if self.css_class in element.get_attribute("class"):
return element
else:
return False
# Wait until an element with id='myNewInput' has class 'myCSSClass'
wait = WebDriverWait(driver, 10)
element = wait.until(element_has_css_class((By.ID, 'myNewInput'), "myCSSClass"))
2. Implicit Waits
Implicit Waits are very rarely used in Selenium. This tells the web driver to poll the DOM for a certain time when the specified element is not immediately available.
from selenium import webdriver
driver = webdriver.Firefox()
driver.implicitly_wait(10) # seconds
driver.get("http://somedomain/url_that_delays_loading")
myDynamicElement = driver.find_element_by_id("myDynamicElement")
F. Page Objects
Finally, we will learn about page object design patterns. In general, The area of a web page where the test is interacting is called page objects.
Some benefits of using page objects patterns are:
- Code that can be used in many places can be written.
- Code that is written many times can be reduced.
- Similarly, If there are changes in the user interface, the code of only one place should be replaced.
1. Test Case
Here is a code which searches if a text is present in the given domain.
import unittest
from selenium import webdriver
import page
class CopyAssignmentsSearch(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
self.driver.get("https://copyassignment.com/")
def test_search_in_python_org(self):
"""
Tests violet-cat-415996.hostingersite.com search feature. Searches for the word "Anaconda" then verified that some results show up.
Note that it does not look for any particular text in search results page. This test verifies that
the results were not empty.
"""
#Load the main page. In this case the home page of violet-cat-415996.hostingersite.com.
main_page = page.MainPage(self.driver)
#Checks if the word "Anaconda" is in title
assert main_page.is_title_matches(), "violet-cat-415996.hostingersite.com title doesn't match."
#Sets the text of search textbox to "Anaconda"
main_page.search_text_element = "pycon"
main_page.click_go_button()
search_results_page = page.SearchResultsPage(self.driver)
#Verifies that the results page is not empty
assert search_results_page.is_results_found(), "No results found."
def tearDown(self):
self.driver.close()
if __name__ == "__main__":
unittest.main()
2. Page Object Classes
The POP helps in creating a separation between test code and technical implementation.
from element import BasePageElement
from locators import MainPageLocators
class SearchTextElement(BasePageElement):
"""This class gets the search text from the specified locator"""
#The locator for search box where search string is entered
locator = 'q'
class BasePage(object):
"""Base class to initialize the base page that will be called from all pages"""
def __init__(self, driver):
self.driver = driver
class MainPage(BasePage):
"""Home page action methods come here. I.e. Python.org"""
#Likewise, Declares a variable that will contain the retrieved text
search_text_element = SearchTextElement()
def is_title_matches(self):
"""Verifies that the hardcoded text "Python" appears in page title"""
return "Python" in self.driver.title
def click_go_button(self):
"""Triggers the search"""
element = self.driver.find_element(*MainPageLocators.GO_BUTTON)
element.click()
class SearchResultsPage(BasePage):
"""Search results page action methods come here"""
def is_results_found(self):
#Finally, Probably should search for this text in the specific page
# element, but as for now it works fine
return "No results found." not in self.driver.page_source
3. Page Elements
from selenium.webdriver.support.ui import WebDriverWait
class BasePageElement(object):
"""Base page class that is initialized on every page object class."""
def __set__(self, obj, value):
"""Sets the text to the value supplied"""
driver = obj.driver
WebDriverWait(driver, 100).until(
lambda driver: driver.find_element_by_name(self.locator))
driver.find_element_by_name(self.locator).clear()
driver.find_element_by_name(self.locator).send_keys(value)
def __get__(self, obj, owner):
"""Gets the text of the specified object"""
driver = obj.driver
WebDriverWait(driver, 100).until(
lambda driver: driver.find_element_by_name(self.locator))
element = driver.find_element_by_name(self.locator)
return element.get_attribute("value")
4. Locators
In this tradition, the locators of the page are kept in the same class where they are being used.
from selenium.webdriver.common.by import By
class MainPageLocators(object):
"""A class for main page locators. All main page locators should come here"""
GO_BUTTON = (By.ID, 'submit')
class SearchResultsPageLocators(object):
"""A class for search results locators. All search results locators should come here"""
pass
Other Tools Like Selenium:
- Katalon Studio
- Subject7
- Screenster
- TestCraft
- Endtest
- Browsersync
- Protractor
- CasperJS
Some Other Python Web Scraping Tools:
- Beautiful Soup
- LXML
- MechanicalSoup
- Python Requests
- Scrapy
- urllib
This is a complete python web scraping tutorial and I hope you have learned enough from it as I have tried all the basics which I know by myself. Still, the article of design is left which will be done soon.
Comment if you have any queries or found something wrong in the article or on our website.
Thanks for Reading
Keep Learning
Also Read:
Download Chrome Driver for Selenium
Automate Facebook Login Using Python Selenium
Jarvis and Google Assistant || Voice Assistant Using Python
Crack Any Password Using Python
Get WiFi Passwords With Python
GUI Application To See wifi password in Python