How to Extract Structured Data with AI
The quality of structured data extraction depends primarily on how well you prepare the extraction task. A well-designed schema with clear field descriptions consistently outperforms a hastily defined one, even when using the same LLM and the same target pages.
Analyze Your Target Pages
Before writing any schema, spend time examining 10 to 20 pages from your target site. Open each page in a browser and identify every piece of data you might want to extract. Note how the data is presented: is the price shown with a currency symbol, are dates relative or absolute, are specifications in a table or scattered through the description text.
Pay attention to variation between pages. Some product pages might have ratings while others do not. Some might show inventory counts while others just show "in stock" or "out of stock." Some might have multiple prices (original, sale, member). Understanding this variation is essential for designing a schema that handles all cases correctly.
Document edge cases: pages with missing data, pages with unusual formatting, pages in different categories that use different templates. These edge cases are where extraction most commonly fails, and knowing about them upfront lets you design your schema and validation to handle them gracefully.
Design Your JSON Schema
Build your schema with three principles: be specific about what each field means, use appropriate data types, and only include fields you actually need. A schema with five well-defined fields produces better results than one with twenty loosely defined fields.
For each field, write a description that eliminates ambiguity. "Price" is ambiguous on a page showing original price, sale price, and member price. "The current sale price shown to non-member customers, as a decimal number excluding tax and shipping" is unambiguous. The more specific your descriptions, the more consistent your extraction results will be.
Use appropriate types: numbers for prices and quantities, booleans for binary states like availability, ISO 8601 strings for dates, arrays for repeated elements like product variants or review lists. Type specification enables your validation layer to catch extraction errors automatically.
Mark fields as required only when they should always be present on every page you scrape. Fields that might be missing on some pages should be optional. Marking optional fields as required causes unnecessary validation failures and obscures real extraction problems.
Preprocess Page Content
Raw HTML from the rendering stage contains enormous amounts of irrelevant content: navigation menus, footers, sidebar widgets, ad scripts, analytics code, and deeply nested layout divs. Converting to clean markdown before extraction reduces token consumption by 60 to 80 percent and improves extraction accuracy by removing distracting content.
Use main content extraction to isolate the primary content area from boilerplate. Libraries like Mozilla Readability identify the main content block and strip everything else. This further reduces token usage and prevents the model from accidentally extracting data from navigation links or footer content.
Preserve meaningful structure during cleaning. Tables should remain as tables (in markdown format) because the row and column relationships are important for extraction. Lists should preserve their hierarchy. Links can be preserved or stripped depending on whether URL extraction is part of your schema.
Run Extraction and Validate
Send the cleaned content to your LLM along with the extraction schema. Use a low temperature setting (0.0 to 0.2) for maximum consistency. If your tool supports it, use structured output mode or function calling to ensure the model returns valid JSON rather than free-form text with embedded JSON.
Validate every extraction result before passing it downstream. Check that required fields are present and non-empty. Verify that data types match the schema, converting strings to numbers or dates as needed. Apply range checks for numerical fields. Run cross-field consistency checks where applicable.
For results that fail validation, implement a retry strategy. The simplest approach retries with the same prompt and model, relying on LLM non-determinism to potentially produce a valid result. A more sophisticated approach retries with a more detailed prompt or a more capable model. After exhausting retries, flag the page for manual review rather than accepting invalid data.
Iterate and Optimize
After running extraction on your full set of target pages, analyze the results for patterns. Which fields have the highest error rates? Which pages produce validation failures? Are there consistent formatting issues that a post-processing rule could fix?
Refine field descriptions for fields with high error rates. If the model consistently extracts the wrong price, make the description more specific. If dates come back in inconsistent formats, add format examples to the description. If a boolean field returns text values, clarify in the description that you want true or false specifically.
Optimize cost by testing whether a smaller, cheaper model produces acceptable accuracy for your specific extraction task. Many straightforward extractions work well with smaller models, reserving larger models for complex pages or retry attempts. Track accuracy by model to find the most cost-effective configuration for each extraction task.
Structured data extraction quality depends on schema design more than model capability. Invest time in analyzing target pages, writing specific field descriptions, and building thorough validation. Iterate on the schema based on real extraction results to continuously improve accuracy.