Skip to content

feat(): Introduce domain specific parsers#605

Open
Dunky13 wants to merge 5 commits intocmintey:mainfrom
Dunky13:feat/domain-specific-rules
Open

feat(): Introduce domain specific parsers#605
Dunky13 wants to merge 5 commits intocmintey:mainfrom
Dunky13:feat/domain-specific-rules

Conversation

@Dunky13
Copy link
Copy Markdown
Contributor

@Dunky13 Dunky13 commented Jan 8, 2026

As discussed here: #594
I suggested a domain specific parser. Yes, hm.com gave access denied, due to akamai protection, but it did load a few times during testing.

As you can see it can easily be extended by the community with domain specific parsing. I've tried to keep is as simple and reproducable as possible, whilst keeping the original parser intact.

Also introduced typings that are in line with the metascraper package, since Check wasn't exposed anymore in the installed version

@Dunky13 Dunky13 marked this pull request as draft January 8, 2026 21:04
@Dunky13 Dunky13 marked this pull request as ready for review January 9, 2026 08:59
@cmintey
Copy link
Copy Markdown
Owner

cmintey commented Jan 12, 2026

Thanks so much for fixing the types!! I haven't thoroughly reviewed yet, but my first thought (and what I was getting at in the issue discussion) is that H&M and Ikea are not "custom". I mean, there might be other sites that benefit from those same rules considering you're just extracting from the Json-LD, so I think those should just be part of the default rule set. bol.com seems to be mostly custom, at least by using the variant id, so that one makes sense to me to be a truly custom rule set. Other than that, again at first glance, I like the domain framework that you've set up. I'll do a thorough review when I have some more time

Copy link
Copy Markdown
Owner

@cmintey cmintey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, thank you for your work on this and cleaning things up! I have a few comments I'd like to see addressed.

Comment on lines +42 to +43
const name = jsonld ? getProperty(jsonld, "name") : undefined;
return typeof name === "string" ? name : undefined;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use this pattern a lot and you've actually already created a helper in domain-helpers, so you should use that instead of repeating this code a bunch

Comment on lines +177 to +178
const formatted = toPriceFormat($('[property="og:price:amount"]').attr("content"));
return formatted !== undefined ? String(formatted) : undefined;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be better to move this logic into the toPriceFormat method to cast to a string

* Extract price from JSON-LD offers with proper formatting.
*/
export const extractPrice = (data: unknown): string | undefined => {
return extractOffersField(data, "price");
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should make use of toPriceFormat

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup; will fix

*/
const domains = ["ikea.com"];

export const ikeaComRules: ShoppingDomainRules = rulesForDomain(domains, {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does Ikea need a custom parser? This site works for me with the base parser. Also, the rules here are not unique to Ikea and are basically the same as the base parser

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had pages that didn't load for me - where the current parser did not pick up meta data.

*/
const domains = ["hm.com"];

export const hmComRules: ShoppingDomainRules = rulesForDomain(domains, {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be a custom parser. There are likely several sites that also use this ProductGroup schema with a variant. I'd prefer these rules just be added to the base parser

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons why I considered it as a separate app is due to performance. For every rule you add to the "base parser", the more it needs to process, even if the page doesn't contain such an element. It still needs to be parsed (albeit memoized, but still) - where it could've been skipped alltogether..

Also, I used hm, ikea & bol as examples. Happy to omit one or more, if that has a preference

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the performance hit is very large. I'd rather the parsing take just a few ms longer to have a robust set of standard rules and then only have a few custom rules for outliers. Like I think it makes sense for Amazon to have it's own parser, since as you pointed out, there are several Amazon-only rules in the base parser

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants