HTML Meta is a PHP package for parsing website metadata, such as titles, favicons, OpenGraph tags and others.
To install the package via Composer, run:
composer require osmuhin/html-meta
Note
Ensure that the vendor/autoload.php file is required in your code to enable the autoloading mechanism provided by Composer.
use Osmuhin\HtmlMeta\Crawler;
$meta = Crawler::init(url: 'https://google.com')->run();
echo $meta->title; // Google
Instead of URL, you can parse metadata from Raw HTML passing it as a string:
$html = <<<END
<html lang="en">
<head>
<title>Google</title>
<meta charset="UTF-8">
<link rel="icon" href="/favicon.ico">
</head>
</html>
END;
$meta = Crawler::init(html: $html, url: 'https://google.com')->run();
$icon = $meta->favicon->icons[0];
echo $icon->url // https://google.com/favicon.ico
Always pass the
url
parameter when using raw HTML to resolve relative paths correctly.
Under the hood, the GuzzleHttp library is used to get html, so you can create your own request object and pass it as a $request
parameter:
$request = new \GuzzleHttp\Psr7\Request('GET', 'https://google.com');
$meta = Crawler::init(request: $request)->run();
All properties of the meta
object are described here.
You can customize the crawler’s behavior using its configuration methods:
$crawler = Crawler::init(url: 'https://google.com');
$crawler->config
->dontProcessUrls()
->dontUseTypeConversions()
->processUrlsWith('https://yandex.ru')
->dontUseDefaultDistributorsConfiguration();
Setting | Description |
---|---|
dontProcessUrls() |
Disables the conversion of relative URLs to absolute URLs. |
dontUseTypeConversions() |
Disables automatic type conversions (e.g., string to int): <meta property="og:image:height" content="630"> Using type conversions: int(630) Disabled type conversions: string(3) "630" <meta property="og:image:height" content="630.5"> Using type conversions: null Disabled type conversions: string(5) "630.5" |
processUrlsWith(string $url) |
Sets a base URL for resolving relative paths (automatically enables URL processing). |
dontUseDefaultDistributorsConfiguration() |
Disables the default distributor configuration. |
The main interaction happens through the $crawler
object of type \Osmuhin\HtmlMeta\Crawler
.
-
Initialization: Configure the crawler before
run()
calling. -
Execution: After
run()
calling, the crawler performs the following steps:-
fetches the HTML string from the URL (if raw HTML is not provided).
The priority of the parameters, if they are more than 1 is following:string $html
➡\GuzzleHttp\Psr7\Request $request
➡string $url
; -
parses the HTML using the configured xpath:
$crawler->xpath = '//html|//html/head/link|//html/head/meta|//html/head/title';
You are free to overwrite xpath property;
-
passes the parsed elements to the distributor stack;
-
the found HTML element is pass to the distributor stack
If the HTML element passed the conditions, then its value is written to DTO (Data Transfer Object) of the type\Osmuhin\HtmlMeta\Contracts\Dto
; -
after parsing the HTML string, the root DTO
\Osmuhin\HtmlMeta\Dto\Meta
is formed in output.
-
A Distributor validates HTML elements and distributes their data into DTOs.
Distributor must implement the interface \Osmuhin\HtmlMeta\Contracts\Distributor
and has 2 main methods:
public function canHandle(): bool
{
}
public function handle(): void
{
}
canHandle()
- Checks whether the distributor can handle the current element.
If returns true, then all sub-distributors are polled, and then the handle method is called.
handle()
- Distributes the HTML element data by DTOs according to its own rules.
You can view the structure of the simplest TitleDistributor distributor:
class TitleDistributor extends \Osmuhin\HtmlMeta\Distributors\AbstractDistributor
{
public function canHandle(): bool
{
return $this->el->name === 'title';
}
public function handle(): void
{
$this->meta->title = $this->el->innerText;
}
}
You are free to replace some kind distributor of your own, example:
use Osmuhin\HtmlMeta\Distributors\TitleDistributor;
class MyCustomTitleDistributor extends TitleDistributor
{
public function handle(): void
{
$this->meta->title = 'Prefix for title ' . $this->el->innerText;
}
}
replace original TitleDistributor
in initial configuration:
$crawler = Crawler::init(url: 'https://google.com');
$crawler->distributor->setSubDistributor(
MyCustomTitleDistributor::class,
TitleDistributor::class
);
$meta = $crawler->run();
$meta->title === 'Prefix for title Google';
... or even overwrite the distributors tree completely:
$crawler = Crawler::init(url: 'https://google.com');
$crawler->xpath = '//html/head/title';
$crawler->config->dontUseDefaultDistributorsConfiguration();
$crawler->distributor->useSubDistributors(
MyCustomTitleDistributor::init($crawler->container)
);
$meta = $crawler->run();
Default distributors configuration
$crawler->distributor->useSubDistributors(
\Osmuhin\HtmlMeta\Distributors\HtmlDistributor::init(),
\Osmuhin\HtmlMeta\Distributors\TitleDistributor::init(),
\Osmuhin\HtmlMeta\Distributors\MetaDistributor::init()->useSubDistributors(
\Osmuhin\HtmlMeta\Distributors\HttpEquivDistributor::init(),
\Osmuhin\HtmlMeta\Distributors\TwitterDistributor::init(),
\Osmuhin\HtmlMeta\Distributors\OpenGraphDistributor::init()
),
\Osmuhin\HtmlMeta\Distributors\LinkDistributor::init()->useSubDistributors(
\Osmuhin\HtmlMeta\Distributors\LinkRelDistributor::init()->useSubDistributors(
\Osmuhin\HtmlMeta\Distributors\FaviconDistributor::init()
)
)
);
Thank you for considering contributing to this package! Please refer to the Contributing Guidelines for more details.
You can contact me or just come say hi in Telegram: @wischerdson
This package is open-sourced software licensed under the MIT license.