JavaScript is required

XPath following

XPath following

This article systematically analyzes the technical implementation logic of the "follow" function in XPath, explains in detail the differences and application scenarios of the following and following-sibling axes, and provides efficient node positioning strategies and anti-climbing solutions based on engineering practice.


1. Analysis of core concepts

XPath's "follow" functionality is implemented through axes, which are used to locate nodes along a specific direction in the Document Object Model (DOM). The following two types of axes are the most critical:

following axis

Definition: Select all nodes that appear in the document sequence after the current node, regardless of the level and nesting relationship.

Syntax example: //div[@id='header']/following::p

Typical scenarios:

<div id="header">Title</div>

<p>Paragraph 1</p>

<section>

<p>Paragraph 2</p>

</section>

The above expression will match both paragraph 1 and paragraph 2 because both appear after <div id="header">.

following-sibling axis

Definition: Only select the subsequent nodes of the same level as the current node, without crossing levels.

Syntax example: //li[@class='target']/following-sibling::li

Typical scenarios:

<ul>

<li>Project A</li>

<li class="target">Project B</li>

<li>Project C</li>

<li>Project D</li>

</ul>

This expression accurately locates project C and project D, excluding non-same-level nodes.


2. Functional comparison and selection strategy

In addition to the above two axes, the functional differences of other related axes are as follows:

Preceding axis: locates all nodes before the current node in the document order, often used to search the history in reverse order.

Ancestor axis: Traverses all ancestor nodes upwards, suitable for locating the container that wraps the target element.

Selection suggestion:

If you need to search for subsequent content across levels (such as scattered price fields in a product details page), use the following axis first.

When processing structured data such as tables and lists (such as fields in the same row of a financial report), the following-sibling axis is more efficient.

When encountering dynamic ID or class name confusion, you can combine the stable features of adjacent elements (such as the data-testid attribute) with the following axis to achieve precise positioning.


3. Engineering Practice and Anti-climbing Countermeasures

Scenario 1: E-commerce price monitoring

Requirement: Capture the price element after the product title (may be nested in multiple layers of <div>).

Solution:

//h2[contains(text(),'Product Name')]/following::span[@class='price'][1]

Technical points:

By following, you can penetrate the hierarchical restrictions and directly locate the first price tag.

Add [1] index to avoid crawling duplicate content.

Scenario 2: IP protection for high-frequency data collection

Problem: Platforms such as LinkedIn trigger IP blocking for high-frequency requests.

Countermeasures:

Use a proxy IP pool (such as abcproxy's residential proxy) to rotate the request source IP.

Cooperate with the following-sibling axis to reduce invalid requests (precise positioning reduces the number of page parsing times).

Set a random request interval (2-10 seconds) to simulate the human operation rhythm.

Scenario 3: Dynamic rendering page adaptation

Challenge: Pages generated by frameworks such as React/Vue need to wait for JavaScript rendering to complete.

Solution design:

Use Selenium or Playwright to control the headless browser to load the complete DOM.

Combined with explicit wait (WebDriverWait) to ensure the target element has finished loading.

Use the following axis to position dynamically generated recommended content blocks.


4. Performance Optimization and Common Pitfalls

Performance pitfalls:

The full-document scanning nature of the following axis may cause slow queries on large pages.

Optimization plan:

//div[@id='content-area']//following::div[contains(@class,'target')]

Narrow the search by specifying an ancestor node (such as id='content-area').

Dynamic content invalidation:

If XPath fails on some pages, it may be that the DOM is not ready due to asynchronous loading. You need to add a retry mechanism or adjust the waiting strategy.


5. Technological Evolution and Expanded Applications

Smart positioning tools:

Modern browser developer tools support automatic XPath generation, but the generated paths often rely on volatile hierarchical structures. It is recommended to manually optimize them to robust expressions based on axis positioning.

Working with CSS selectors:

Simple scenario: CSS selectors (such as div.target + ul) are preferred.

Complex scenarios: Switch XPath axes to achieve cross-level positioning (such as following-sibling::ul).

AI-assisted positioning:

Some testing tools (such as Testim.io) automatically generate XPath through visual recognition, but their logical readability is poor and requires manual verification and optimization.


As a professional proxy IP service provider, abcproxy provides a variety of high-quality proxy IP products, including residential proxy, data center proxy, static ISP proxy, Socks5 proxy, unlimited residential proxy, suitable for a variety of application scenarios. If you are looking for a reliable proxy IP service, welcome to visit the abcproxy official website for more details.

Featured Posts