close
close

Association-anemone

Bite-sized brilliance in every update

Look inside the open source ‘Information Laundry’ tool for examining website content and metadata – Global Investigative Journalism Network
asane

Look inside the open source ‘Information Laundry’ tool for examining website content and metadata – Global Investigative Journalism Network

I like to investigate websites. I wrote a chapter about it for the latest edition of Verification manual and are always looking for new tools and methods to connect sites to each other, identify owners, and analyze site content, infrastructure, and behavior.

The Information laundry is one of the newest and most exciting free website analytics tools I’ve come across. Developed by the George Marshall Fund’s Alliance for Ensuring Democracy (ASD), can analyze content and metadata. ASD, with researchers from the University of Amsterdam and the Institute for Strategic Dialogue, used it in their recent report, “The Russian Propaganda Nesting Doll: How RT Is Layered in the Digital Information Environment.”

The Information Laundry can analyze two elements: the content posted on a site and the metadata used to build and run it. Here’s a quick rundown of how it works, based on my initial testing and an interview with Petru Benzonithe tool developer.

Peter told me that the Information Laundry works best for lead generation: “It shouldn’t automate your investigation.” The Information Laundry is open source and available on ASDs GitHub account.

Content vs. Metadata similarity, Information Laundromat site analysis tool

Image: Screenshot, Digital Investigations

This tool analyzes a link, title or piece of text to identify other web properties with similar or identical content. It was useful in the ASD investigation because they wanted to see which sites were consistently copying from Russia Today (RT), the Russian state broadcaster. They were able to identify sites that consistently reprinted RT’s content and played a role in helping push RT’s narratives around the web by washing its narratives, according to research.

How it works

  • Enter the URL, title, or piece of text you want to check.
  • The system analyzes search engines, Copyscape plagiarism checker and GDELT database to analyze and rank the similarity of source content and other sites.
  • A results page sorts sites by percentage of content similar to the original source.

I ran an example search with a URL that I knew was a near carbon copy of a news article published elsewhere. The Information Laundromat correctly identified the original source of the text, giving it a similarity score of 97%.

Content Similarity Score Checker Information Laundromat Web Site Analysis Tool

Image: Screenshot, Digital Investigations

The tool too highlights what is not do,

Content similarity search tries to find similar articles or text on the open web. It does not provide evidence of the origin of that text or any relationship between two entities posting two similar texts. Determining the provenance of a particular text is beyond the scope of this tool.

If you get a lot of results, Peter suggested “downloading everything into Excel and looking through it a bit yourself with a pivot table.”

According to Peter, sites with a similarity rating of 70% or higher are likely to be of most interest. The tool also has a batch upload option if you register on the site.

Content Metadata Similarity URLSCAN Information Laundromat website analysis tool

Image: Screenshot, Digital Investigations

The Information Laundromat metadata similarity tool works best when you have a set of sites you want to analyze. It is possible, but less efficient, to use it to analyze a single site.

How it works

  • Enter a set of domains that you want to scan for shared connections.
  • The tool scans each domain, including infrastructure such as IP addresses and source code, to extract unique indicators and determine if there is overlap between domains. It flags direct matches for IP addresses and also highlights whether sites are hosted in the same IP range, which is a weaker connection but still worth noting. In addition to looking for unique advertising and analytics codes, the tool scans a site’s CSS file to look for similarities. Peter told me that it “must be more than 90% similar CSS classes” for the tool to flag it as notable. (View the full list of website indicators tool Here.)
  • The metadata page sorts the results into two sections.
    • The first table lists the indicators present on each site.
    • The second table identifies the indicators shared between sites.
  • The tool also sorts the results in each table based on the relative strength of each indicator. (I explain more in the final section of this post.)

“The idea of ​​trying to get everything you can say about sites that you could use together to build links to sites,” Peter told me.

If you are not familiar with the method of connecting sites to each other through analytics and ad codes, you can read on this basic guide and this recent post from me (read the guide first!). Information Laundromat’s metadata module is most useful if you are familiar with website infrastructure such as IP addresses and understand how to link websites together using pointers. The risk in using this tool comes from not understanding the relative strengths and weaknesses of each indicator and connection. (More on that below.)

Peter said the metadata analysis tool is a great starting point for finding connections between a set of sites.

“If you have a set of sites and you want to understand the potential overlap, then this is a good way to get a quick snapshot of them, as opposed to running them manually in a bunch of other tools,” he said . said.

I agree that it is potentially a good starting point if you have a set of sites that you think might have connections. The Information Laundry will provide a useful overview of potential connections. Then you can take them and do a deeper dive using tools like DNSlytics, Built With, SpyOnWeb, and preferred passive DNS platform.

Although the tool works best with a group of domains, you can run a metadata search with a single URL. This is useful if you want the system to extract indicators such as parser codes so you can easily search in places like DNSlytics. You can also see if the URL shares pointers with the set of approximately 10,000 domains stored in the information laundry database. Of the instrument about page lists the sources.

In particular, Peter said that as of now, the tool does not add user-entered domains to the database. So, if you’re searching using a set of domains you consider sensitive, you can rest easy knowing that the tool won’t add your sites to the Information Laundromat dataset.

As mentioned above, it is essential to understand the relative strengths and weaknesses of the location indicators highlighted by the tool. Otherwise you risk exaggerating the connection between sites. Fortunately, the Information Laundry documentation provides a useful breakdown of the indicators.

For example, it’s a weak connection if multiple sites use WordPress as their content management system. Hundreds of millions of websites use WordPress; it is not a useful signal in itself to link sites to each other. But the connection between sites is much stronger if they all use the same Google AdSense code.

Ideally, you want to identify several technical indicators that connect a set of sites and combine them with other information to properly assess the strength of the connections.

To help with the analysis, the Information Laundry has sorted the indicators into three levels. The results page helpfully uses color coding to guide you to strong, moderate or weak indicators. You still need to do your own analysis, but it’s a useful starting point.

An example metadata search runs using RT-related domains. Image: Screenshot, Digital Investigations

Here are the three levels of indicators from the Information Laundry documentation.

    • Level 1: These “are usually unique or highly indicative of a website’s provenance” and include “unique IDs for verification purposes and web services such as Google, Yandex etc, as well as website metadata such as WHOIS information and certification “.
    • Level 2: Such indicators “provide a moderate level of certainty about the provenance of a website”. They “provide valuable context” and include “IPs on the same subnet, matching meta tags, and common elements in standard and custom response headers.”
    • Level 3: They suggest using these indicators in combination with higher-level indicators. Level 3 includes “shared CSS classes, UUIDs, and content management systems.”

Editor’s Note: This post was originally published on ProPublica reporter Craig Silverman Digital investigations Substiva and is reprinted here with permission.


Craig Silverman is a national reporter for ProPublica, covering voting, platforms, misinformation and online manipulation. He was previously media editor at BuzzFeed News, where he pioneered coverage of digital disinformation.