Here’s a detailed outline for the JSON-CSS Extraction Strategy video, covering all key aspects and supported structures in Crawl4AI:
10.1 JSON-CSS Extraction Strategy
1. Introduction to JSON-CSS Extraction
- JSON-CSS Extraction is used for pulling structured data from pages with repeated patterns, like product listings, article feeds, or directories.
- This strategy allows defining a schema with CSS selectors and data fields, making it easy to capture nested, list-based, or singular elements.
2. Basic Schema Structure
- Schema Fields: The schema has two main components:
baseSelector: A CSS selector to locate the main elements you want to extract (e.g., each article or product block).fields: Defines the data fields for each element, supporting various data types and structures.
3. Simple Field Extraction
- Example HTML:
- Schema:
- Explanation: Each field captures text content from specified CSS selectors within each
.productelement.
4. Supported Field Types: Text, Attribute, HTML, Regex
-
Field Type Options:
text: Extracts visible text.attribute: Captures an HTML attribute (e.g.,src,href).html: Extracts the raw HTML of an element.regex: Allows regex patterns to extract part of the text.
-
Example HTML (including an image):
- Schema:
schema = { "baseSelector": ".product", "fields": [ {"name": "title", "selector": ".title", "type": "text"}, {"name": "image_url", "selector": ".product-image", "type": "attribute", "attribute": "src"}, {"name": "price", "selector": ".price", "type": "regex", "pattern": r"\$(\d+\.\d+)"}, {"name": "description_html", "selector": ".description", "type": "html"} ] } - Explanation:
attribute: Extracts thesrcattribute from.product-image.regex: Extracts the numeric part from$19.99.html: Retrieves the full HTML of the description element.
5. Nested Field Extraction
- Use Case: Useful when content contains sub-elements, such as an article with author details within it.
- Example HTML:
- Schema:
- Explanation:
nested: Extractsnameandbiowithin.author, grouping the author details in a singleauthorobject.
6. List and Nested List Extraction
- List: Extracts multiple elements matching the selector as a list.
- Nested List: Allows lists within lists, useful for items with sub-lists (e.g., specifications for each product).
- Example HTML:
- Schema:
- Explanation:
list: Captures each.featureitem within.features, outputting an array of features under thefeaturesfield.
7. Transformations for Field Values
- Transformations allow you to modify extracted values (e.g., converting to lowercase).
- Supported transformations:
lowercase,uppercase,strip. - Example HTML:
- Schema:
- Explanation: The
transformproperty changes thetitleto uppercase, useful for standardized outputs.
8. Full JSON-CSS Extraction Example
- Combining all elements in a single schema example for a comprehensive crawl:
- Example HTML:
<div class="product"> <h2 class="title">Featured Product</h2> <img class="product-image" src="product.jpg"> <span class="price">$99.99</span> <p class="description">Best product of the year.</p> <ul class="features"> <li class="feature">Durable</li> <li class="feature">Eco-friendly</li> </ul> </div> - Schema:
schema = { "baseSelector": ".product", "fields": [ {"name": "title", "selector": ".title", "type": "text", "transform": "uppercase"}, {"name": "image_url", "selector": ".product-image", "type": "attribute", "attribute": "src"}, {"name": "price", "selector": ".price", "type": "regex", "pattern": r"\$(\d+\.\d+)"}, {"name": "description", "selector": ".description", "type": "html"}, {"name": "features", "type": "list", "selector": ".features .feature", "fields": [ {"name": "feature", "type": "text"} ]} ] } - Explanation: This schema captures and transforms each aspect of the product, illustrating the JSON-CSS strategy’s versatility for structured extraction.
9. Wrap Up & Next Steps
- Summarize JSON-CSS Extraction’s flexibility for structured, pattern-based extraction.
- Tease the next video: 10.2 LLM Extraction Strategy, focusing on using language models to extract data based on intelligent content analysis.
This outline covers each JSON-CSS Extraction option in Crawl4AI, with practical examples and schema configurations, making it a thorough guide for users.