Skip to content

PHP package for parsing website metadata, such as titles, favicons, OpenGraph tags, and more.

License

Notifications You must be signed in to change notification settings

wischerdson/html-meta

Repository files navigation

HTML meta logo

Tests status Total Downloads License

HTML Meta is a PHP package for parsing website metadata, such as titles, favicons, OpenGraph tags and others.


Installation

To install the package via Composer, run:

composer require osmuhin/html-meta

Note

Ensure that the vendor/autoload.php file is required in your code to enable the autoloading mechanism provided by Composer.

Basic usage

Parsing Metadata from URL

use Osmuhin\HtmlMeta\Crawler;

$meta = Crawler::init(url: 'https://google.com')->run();

echo $meta->title; // Google

Parsing Metadata from Raw HTML

Instead of URL, you can parse metadata from Raw HTML passing it as a string:

$html = <<<END
<html lang="en">
    <head>
        <title>Google</title>
        <meta charset="UTF-8">
        <link rel="icon" href="/favicon.ico">
    </head>
</html>
END;

$meta = Crawler::init(html: $html, url: 'https://google.com')->run();

$icon = $meta->favicon->icons[0];

echo $icon->url // https://google.com/favicon.ico

Always pass the url parameter when using raw HTML to resolve relative paths correctly.

Using a Custom Request Object

Under the hood, the GuzzleHttp library is used to get html, so you can create your own request object and pass it as a $request parameter:

$request = new \GuzzleHttp\Psr7\Request('GET', 'https://google.com');

$meta = Crawler::init(request: $request)->run();

All properties of the meta object are described here.

Configuration

You can customize the crawler’s behavior using its configuration methods:

$crawler = Crawler::init(url: 'https://google.com');
$crawler->config
    ->dontProcessUrls()
    ->dontUseTypeConversions()
    ->processUrlsWith('https://yandex.ru')
    ->dontUseDefaultDistributorsConfiguration();
Setting Description
dontProcessUrls() Disables the conversion of relative URLs to absolute URLs.
dontUseTypeConversions() Disables automatic type conversions (e.g., string to int):

<meta property="og:image:height" content="630">
Using type conversions: int(630)
Disabled type conversions: string(3) "630"

<meta property="og:image:height" content="630.5">
Using type conversions: null
Disabled type conversions: string(5) "630.5"
processUrlsWith(string $url) Sets a base URL for resolving relative paths (automatically enables URL processing).
dontUseDefaultDistributorsConfiguration() Disables the default distributor configuration.

Core concepts

The Crawler object

The main interaction happens through the $crawler object of type \Osmuhin\HtmlMeta\Crawler.

  1. Initialization: Configure the crawler before run() calling.

  2. Execution: After run() calling, the crawler performs the following steps:

    • fetches the HTML string from the URL (if raw HTML is not provided).
      The priority of the parameters, if they are more than 1 is following: string $html\GuzzleHttp\Psr7\Request $requeststring $url;

    • parses the HTML using the configured xpath:

      $crawler->xpath = '//html|//html/head/link|//html/head/meta|//html/head/title';

      You are free to overwrite xpath property;

    • passes the parsed elements to the distributor stack;

    • the found HTML element is pass to the distributor stack
      If the HTML element passed the conditions, then its value is written to DTO (Data Transfer Object) of the type \Osmuhin\HtmlMeta\Contracts\Dto;

    • after parsing the HTML string, the root DTO \Osmuhin\HtmlMeta\Dto\Meta is formed in output.

Distributors

A Distributor validates HTML elements and distributes their data into DTOs.

Distributor must implement the interface \Osmuhin\HtmlMeta\Contracts\Distributor and has 2 main methods:

public function canHandle(): bool
{

}

public function handle(): void
{

}

canHandle() - Checks whether the distributor can handle the current element. If returns true, then all sub-distributors are polled, and then the handle method is called.

handle() - Distributes the HTML element data by DTOs according to its own rules.

You can view the structure of the simplest TitleDistributor distributor:

class TitleDistributor extends \Osmuhin\HtmlMeta\Distributors\AbstractDistributor
{
    public function canHandle(): bool
    {
        return $this->el->name === 'title';
    }

    public function handle(): void
    {
        $this->meta->title = $this->el->innerText;
    }
}

You are free to replace some kind distributor of your own, example:

use Osmuhin\HtmlMeta\Distributors\TitleDistributor;

class MyCustomTitleDistributor extends TitleDistributor
{
    public function handle(): void
    {
        $this->meta->title = 'Prefix for title ' . $this->el->innerText;
    }
}

replace original TitleDistributor in initial configuration:

$crawler = Crawler::init(url: 'https://google.com');
$crawler->distributor->setSubDistributor(
    MyCustomTitleDistributor::class,
    TitleDistributor::class
);

$meta = $crawler->run();
$meta->title === 'Prefix for title Google';

... or even overwrite the distributors tree completely:

$crawler = Crawler::init(url: 'https://google.com');
$crawler->xpath = '//html/head/title';
$crawler->config->dontUseDefaultDistributorsConfiguration();

$crawler->distributor->useSubDistributors(
    MyCustomTitleDistributor::init($crawler->container)
);

$meta = $crawler->run();
Default distributors configuration
$crawler->distributor->useSubDistributors(
    \Osmuhin\HtmlMeta\Distributors\HtmlDistributor::init(),
    \Osmuhin\HtmlMeta\Distributors\TitleDistributor::init(),
    \Osmuhin\HtmlMeta\Distributors\MetaDistributor::init()->useSubDistributors(
        \Osmuhin\HtmlMeta\Distributors\HttpEquivDistributor::init(),
        \Osmuhin\HtmlMeta\Distributors\TwitterDistributor::init(),
        \Osmuhin\HtmlMeta\Distributors\OpenGraphDistributor::init()
    ),
    \Osmuhin\HtmlMeta\Distributors\LinkDistributor::init()->useSubDistributors(
        \Osmuhin\HtmlMeta\Distributors\LinkRelDistributor::init()->useSubDistributors(
            \Osmuhin\HtmlMeta\Distributors\FaviconDistributor::init()
        )
    )
);

Contributing

Thank you for considering contributing to this package! Please refer to the Contributing Guidelines for more details.

You can contact me or just come say hi in Telegram: @wischerdson

License

This package is open-sourced software licensed under the MIT license.

About

PHP package for parsing website metadata, such as titles, favicons, OpenGraph tags, and more.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published