Discovery

Indieweb utils provides a number of functions to help you determine properties from webpages.

Discover IndieWeb endpoints

The discover_endpoints() function parses HTTP Link headers and HTML <link> tags to find the specified endpoints.

indieweb_utils.discover_endpoints(url: str, headers_to_find: List[str], request: Response | None = None)[source]

Return a dictionary of specified endpoint locations for the given URL, if available.

Parameters:
  • url (str) – The URL to discover endpoints for.

  • headers_to_find (dict[str, str]) – The headers to find. Values you may want to use include: microsub, micropub, token_endpoint, authorization_endpoint, subscribe.

Returns:

The discovered endpoints.

Return type:

dict[str, str]

import indieweb_utils
import requests

url = "https://jamesg.blog/"

# find the microsub rel link on a web page
headers_to_find = ["microsub"]

endpoints = indieweb_utils.discover_endpoints(
    url
)

print(endpoints) # {'microsub': 'https://aperture.p3k.io/'}
Raises:

requests.exceptions.RequestException – Error raised while making the network request to discover endpoints.

This function only returns the specified endpoints if they can be found. It does not perform any validation to check that the discovered endpoints are valid URLs.

We recommend using the discover_webmention_endpoint function to discover webmention endpoints as this performs additional validation useful in webmention endpoint discovery.

Find an article author

You can discover the original author of an article as per the Authorship Specification.

To do so, use this function:

indieweb_utils.discover_author(url: str, html: str = '', parsed_mf2: Parser | None = None) dict[source]

Discover the author of a post per the IndieWeb Authorship specification.

Refs:

https://indieweb.org/authorship-spec

Parameters:
  • url (str) – The URL of the post.

  • page_contents (str) – The optional page contents to use. Specifying this value prevents a HTTP request being made to the URL.

Returns:

A h-card of the post.

Return type:

dict

import indieweb_utils
import mf2py

url = "https://jamesg.blog/2022/01/28/integrated-indieweb-services/"

parsed_mf2 = mf2py.parse(url=url)

post_author = indieweb_utils.discover_author(
    h_entry
)

print(post_author) # A h-card object representing the post author.

Here are the arguments you can use:

  • url: The URL of the web page whose author you want to discover.

  • page_contents: The unmodified HTML of a web page whose author you want to discover.

The page_contents argument is optional.

If no page_contents argument is specified, the URL you stated is retrieved. Then, authorship inference begins.

If you specify a page_contents value, the HTML you parsed is used for authorship discovery. This saves on a HTML request if you have already retrieved the HTML for another reason (for example, if you need to retreive other values in the page HTML). You still need to specify a URL even if you specify a page_contents value.

The discover_author function can return one of two values:

  • An author name (i.e. “James”).

  • The h-card of an author.

These are the two outputs defined in the authorship inference algorithm. Your program should be able to handle both of these outputs.

This code returns the following h-card:

{
    'type': ['h-card'],
    'properties': {
        'url': ['https://aaronparecki.com/'],
        'name': ['Aaron Parecki'],
        'photo': ['https://aaronparecki.com/images/profile.jpg']
    },
    'value': 'https://aaronparecki.com/'
}

Find a post type

To find the post type associated with a web page, you can use the get_post_type function.

The get_post_type function function uses the following syntax:

indieweb_utils.get_post_type(h_entry: dict = {}, custom_properties: List[Tuple[str, str]] = []) str[source]

Return the type of a h-entry per the Post Type Discovery algorithm.

Parameters:
  • h_entry (dict) – The h-entry whose type to retrieve.

  • custom_properties (list[tuple[str, str]]) – The optional custom properties to use for the Post Type Discovery algorithm.

Returns:

The type of the h-entry.

Return type:

str

Here is an example of the function in action:

import indieweb_utils
import mf2py

url = "https://jamesg.blog/2022/01/28/integrated-indieweb-services/"

parsed_mf2 = mf2py.parse(url=url)

h_entry = [e for e in parsed_mf2["items"] if e["type"] == ["h-entry"]][0]

post_type = indieweb_utils.get_post_type(
    h_entry
)

print(post_type) # article
Raises:

PostTypeFormattingError – Raised when you specify a custom_properties tuple in the wrong format.

This function returns a single string with the post type of the specified web page.

See the Post Type Discovery specification for a full list of post types.

Custom Properties

The structure of the custom properties tuple is:

(attribute_to_look_for, value_to_return)

An example custom property value is:

custom_properties = [
    ("poke-of", "poke")
]

This function would look for a poke-of attribute on a web page and return the “poke” value.

By default, this function contains all of the attributes in the Post Type Discovery mechanism.

Custom properties are added to the end of the post type discovery list, just before the “article” property. All specification property types are checked before your custom attribute.

Find the original version of a post

To find the original version of a post per the Original Post Discovery algorithm, use this code:

indieweb_utils.discover_original_post(posse_permalink: str, soup: BeautifulSoup | None = None, html: str = '') str[source]

Find the original version of a post per the Original Post Discovery algorithm.

refs: https://indieweb.org/original-post-discovery#Algorithm

Parameters:

posse_permalink (str) – The permalink of the post.

Returns:

The original post permalink.

Return type:

str

Example:

import indieweb_utils

original_post_url = indieweb_utils.discover_original_post("https://example.com")

print(original_post_url)
Raises:

PostDiscoveryError – A candidate URL cannot be retrieved or when a specified post is not marked up with h-entry.

This function returns the URL of the original version of a post, if one is found. Otherwise, None is returned.

Find all feeds on a page

To discover the feeds on a page, use this function:

indieweb_utils.discover_web_page_feeds(url: str, user_mime_types: List[str] | None = None, html: str = '') List[FeedUrl][source]

Get all feeds on a web page.

Parameters:
  • url (str) – The URL of the page whose associated feeds you want to retrieve.

  • user_mime_types (Optional[List[str]]) – A list of mime types whose associated feeds you want to retrieve.

  • html (str) – A string with the HTML on a page.

Returns:

A list of FeedUrl objects.

Return type:

List[FeedUrl]

Example:

import indieweb_utils

url = "https://jamesg.blog/"

feeds = indieweb_utils.discover_web_page_feeds(url)

# print the url of all feeds to the console
for f in feeds:
    print(f.url)

This function returns a list with all feeds on a page.

Each feed is structured as a FeedUrl object. FeedUrl objects contain the following attributes:

class indieweb_utils.FeedUrl(url: str, mime_type: str, title: str)[source]

Get a Representative h-card

To find the h-card that is considered representative of a web resource per the Representative h-card Parsing Algorithm, use the following function:

indieweb_utils.get_representative_h_card(url: str, html: str = '', parsed_mf2: Parser | None = None) Dict[str, Any][source]

Get the representative h-card on a page per the Representative h-card Parsing algorithm.

refs: https://microformats.org/wiki/representative-h-card-parsing

Url:

The url to parse.

Returns:

The representative h-card.

Return type:

dict

Example:

import indieweb_utils

url = "https://jamesg.blog/"

h_card = indieweb_utils.get_representative_h_card(url)

print(h_card) # {'type': ['h-card'], 'properties': {...}}
Raises:

RepresentativeHCardParsingError – Representative h-card could not be parsed.

This function returns a dictionary with the h-card found on a web page.

Get a Page h-feed

The discover_h_feed() function implements the proposed microformats2 h-feed discovery algorithm.

This function looks for a h-feed on a given page. If one is not found, the function looks for a rel tag to a h-feed. If one is found, that document is parsed.

If a h-feed is found on the related document, the h-feed is returned.

This function returns a dictionary with the h-card found on a web page.

indieweb_utils.discover_h_feed(url: str, html: str = '') Dict[source]

Find the main h-feed that represents a web page as per the h-feed Discovery algorithm.

refs: https://microformats.org/wiki/h-feed#Discovery

Parameters:
  • url (str) – The URL of the page whose associated feeds you want to retrieve.

  • html (str) – The HTML of a page whose feeds you want to retrieve

Returns:

The h-feed data.

Return type:

dict

Example:

import indieweb_utils

url = "https://jamesg.blog/"

hfeed = indieweb_utils.discover_h_feed(url)

print(hfeed)

Infer the Name of a Page

To find the name of a page per the Page Name Discovery Algorithm, use this function:

indieweb_utils.get_page_name(url: str, html: str | None = None, soup: BeautifulSoup | None = None) str[source]

Retrieve the name of a page using the Page Name Discovery algorithm.

Refs:

https://indieweb.org/page-name-discovery

Parameters:
  • url (str) – The url of the page whose title you want to retrieve.

  • html (str) – The HTML of the page whose title you want to retrieve.

Returns:

A representative “name” for the page.

Return type:

str

Example:

import indieweb_utils

page_name = indieweb_utils.get_page_name("https://jamesg.blog")

print(page_name) # "Home | James' Coffee Blog"

This function searches:

  1. For a h-entry title. If one is found, it is returned;

  2. For a h-entry summary. If one is found, it is returned;

  3. For a HTML page <title> tag. If one is found, it is returned;

Otherwise, this function returns “Untitled”.

Get all URLs a Post Replies To

To find all of the URLs to which a reply post is replying, use this function:

indieweb_utils.get_reply_urls(url: str, html: str | None = None) List[str][source]

Retrieve a list of all of the URLs to which a given post is responding using a u-in-reply-to microformat.

Refs:

https://indieweb.org/in-reply-to#How_to_consume_in-reply-to

Parameters:
  • url (str) – The URL to get replies to.

  • html (str) – The HTML of the page whose replies you want to retrieve.

Returns:

A list of all of the URLs to which the given post responds.

Return type:

list

Example:

import indieweb_utils

reply_urls = indieweb_utils.get_reply_urls("https://aaronparecki.com/2022/10/10/17/")

print(reply_urls) # ["https://twitter.com/amandaljudkins/status/1579680989135384576?s=12"]