Common crawler operations using selenium in JAVA

1. Introduce relevant maven dependencies

Here, my springboot version is 2.2.0 Release (it is developed using springboot here. If it is not, it is the same with running with main method. If it is not, it is only for the convenience of storing data to the database later.)


2. Download relevant browser drivers. This is an introduction to Chrome and Edge browsers

Download address of chrome browser:
The driver download address of Edge is:

!!!!! Before downloading the driver, check the version number of your browser and download the corresponding version of the driver, otherwise the program will not be able to call the browser

3. Call the browser (the following code can be run directly in your main method)

//Set the location of the driver you just downloaded
System.setProperty("", "C:\\Users\\Administrator\\Desktop\\chromedriver.exe");
//Create a drive object
//You can also create Edge driving objects. In short, it is a routine
//EdgeDriver edgeDriver = new EdgeDriver();// Created edgedriver
ChromeDriver chromeDriver = new ChromeDriver();
//ChromeDriver object can get a lot of information. You can try it yourself
//Browser accesses an address

4. Common operations (I will continue to update after I encounter them later)

4.1 method of obtaining web page elements

//chromeDriver.findElement() is to query an element. When there are multiple elements, only the first object will be returned
//chromeDriver. The findelements () method can find the list of elements
//The above two methods are to find elements from the root node of the browser
//The obtained WebElement object can continue to operate findElement() or findElements(). This is to start from the current element to find the WebElement object
//Parameters have by id(),By. tagName(),By. XPath (), etc., which can be located in different ways
WebElement body= chromeDriver.findElement(By.tagName("body"));//Find the element of the first body tag on the html page
List<WebElement> divs = body.findElements(By.tagName("div"));//Find div elements under all body Tags
//Generally, any element of the page can be obtained through the combination of the above two methods

4.2 execute the js code of the web page and get the return value of js

//Execute document body. Scrollheight and get the return value (get the height of the body tag)
Long totalHeight = (Long) chromeDriver.executeScript("return document.body.scrollHeight")
//Some operations are particularly troublesome in java, but the operation in js may be much simpler, and you can directly execute js code
//Page scroll down
chromeDriver.executeScript("window.scrollTo(0,100)");//You don't need to return when you don't need to return a value

4.3 move the mouse to an element and click (this operation is required for some simple simulated Login)

//Navigate to the element of the login button first
WebElement loginButton= chromeDriver.findElement("login_id"));
//Move the mouse over the element and click
Actions actions = new Actions(webDriver);//There are many mouse and keyboard actions in the operation object, which can be tried by yourself

4.4 scrolling pages is a good way to solve scrolling loading

//In fact, it is also to execute js code to scroll the web page
//Get the height of the body
//Then scroll to the bottom
Long totalHeight = (Long) chromeDriver.executeScript("return document.body.scrollHeight");
chromeDriver.executeScript("window.scrollTo(0," + totalHeight + ")");

5. Common pit!!!!!!

5.1 when the web page is not loaded completely, it will fail to obtain elements

When the element loading is not completed, html does not have the current element node, and if it cannot be obtained, it will be an error (I feel very strange, but there is no way that others designed it this way)
terms of settlement:

5.1.1 in chromedriver After the get ("") method is executed, the thread sleeps for a period of time, waiting for it to load, but this seems very unwise, but it is indeed a method

Thread.sleep(2000);//The specific sleep time depends on your computer's network speed or the time to load the web page. Just estimate it

5.1.2 take effect globally. Each time you get an element, you will wait for 5 seconds (set according to your own needs). Once the element appears, it will continue to be executed. You only need to set it once, and the driver instance disappears after it is closed.

In this way, the program will only wait until the page is loaded (the small circle on the browser doesn't rotate) to execute the following code. If sometimes most of the elements of the page are loaded, only a few elements are loaded, which is a waste of time

chromeDriver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS);

5.1.4 specify an element to appear before performing the operation

//Set the timeout to 10 seconds
WebDriverWait wait = new WebDriverWait(webDriver, 10);
//Wait until the body element appears before performing the following operations

5.2 some elements are under the iframe tag, so they cannot be obtained

Use code to switch chromeDriver to iframe
Some web pages may have multiple iframe s. At this time, you need to cut them in layer by layer, but you only need to cut them out once

5.2.1 if iframe tag has name or id

//Switch directly to his frame() method, and the parameter is the name or id of iframe

5.2.2 iframe is switched as an element

//Find the tag of iframe
WebElement iframe = driver.findElement(By.tagName("iframe"));
//Switch to iframe

5.2.3 cut out iframe


5.3 solve some anti crawling problems identified by selenium

5.3.1 disable the function of enabling Blink runtime

window. navigator. The value of webdriver is set to false

chromeDriver.executeScript("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})", "");

Tags: Java crawler Selenium

Posted by tail on Sun, 17 Jul 2022 10:05:59 +0930