What's with the "Places > Wikipedia" pages on iNat?

graysquirrel · October 28, 2024, 5:10am

I’m not sure if this is a bug, or has some unknown purpose, but it got me curious.

I was searching for something else using the site:https://www.inaturalist.org/ filter on google, and I couldn’t help but notice that my search results were filled with dozens of pages like this: https://www.inaturalist.org/places/wikipedia/Botany%20Bay
The text on all of them seems to be pulled from wikipedia, as the URL suggests, but none of the embedded images load:

When I go to iNat and search in “Places” for Botany Bay, I get a totally different page, which also says there is no wikipedia page available: https://www.inaturalist.org/places/botany-bay-nsw-australia

Anyone know what the purpose of these pages is?

EDIT: and some of the ‘places’ aren’t actually place pages: https://www.inaturalist.org/places/wikipedia/Capital

DianaStuder · October 28, 2024, 6:39am

For what it is worth - your first link - has photos for me today.

optilete · October 28, 2024, 6:59am

This one has data
https://www.inaturalist.org/places/botany-bay--2

benarmstrong · October 28, 2024, 7:55am

Not to take this too far off track, as I think the fact that some people see images and some don’t is just a browser-specific behaviour and is unrelated to the issue of these coming up in search results. If these serve some purpose for the website, but otherwise aren’t useful pages for search engines like Google to index, the website could fix this by excluding them from indexing by bots, I think.

But as for the browser-specificness of images showing, if I use Firefox on Windows, I don’t see images. But if I press ctrl+shift+M to enter Responsive Design Mode & select an iPhone 11 Pro profile from the menu at the top, then I see images:

Ben

benarmstrong · October 28, 2024, 7:57am

That one, however, is not one of the links with /places/wikipedia in it, but just /places. The Google site: search will find both.

DianaStuder · October 28, 2024, 10:47am

It is strangely formatted - as if it is not intended to be public facing.
Or is it for Wiki editors?

benarmstrong · October 28, 2024, 11:34am

It is a fragment of HTML included in the place pages. When I visit https://www.inaturalist.org/places/botany-bay--2 and then open Developer Tools in Firefox (F12) to the Network tab and then reload the page, the first URL fetched for that page is the /places/wikipedia page we’ve been discussing:

If you click the “About Place” tab for the “botany-bay–2” iNat place, you will see the full content, formatted properly (except for the images that aren’t loading properly - that’s still an issue for Firefox on desktop and probably all browsers except for mobile on small displays):

benarmstrong · October 28, 2024, 11:59am

[edit:] Ultimately, the problem with the images seems to be how the Wikipedia source is imported. I don’t know how iNat does this, but I think it might involve fixing some of iNat’s wikipedia import code for places.

Opening the iNat place page in Developer Tools and clicking the “About Place” tab allows me to debug further …

See all of the “Mixed Block” below in the Network pane?

This is because the src attribute for each broken image has http://upload.wikimedia.org which does not match https: for the iNat page. Because the content is insecure, the browser doesn’t allow it, as it is untrusted. If we right-click the broken image and select “Inspect” to open it in the Inspector, we can find the offending URL and edit it locally (note: only has an effect in the local browser session, not the upstream websites)

With the URL corrected like this, the image now embeds successfully on the iNat site:

On the other hand, the srcset attribute has //upload.wikimedia.org (note, neither http: nor https: is present in the URL). This srcset is used to provide responsive images, which explains why they show correctly on mobile, as the responsive image resizing comes into effect for those platforms, and any URL that starts // uses the same protocol as the web page that includes that asset (so, https: when included in an https: page).

I originally thought this could be fixed by editing the Wikipedia source, but that just has:

{{Infobox body of water
| name               = Botany Bay
| native_name        = 
| native_name_lang   = 
| other_name         = Kamay/Gamay, Sting Ray Harbour<ref>{{Cite web|url=https://www.nma.gov.au/exhibitions/endeavour-voyage/kamay-botany-bay/settling-on-a-name|title= Settling on a name|website=[[National Museum of Australia]]|accessdate=26 March 2023}}</ref>
<!--    Images     -->
| image              = Sydney from Botany Bay looking north (aerial).jpg
| alt                = 
| caption            = Aerial photo of [[Sydney]] showing Botany Bay in the foreground. <br/>
The two protrusions into the bay are runways of [[Sydney Airport]].
…

The Infobox macro should be emitting the correct HTML, so I don’t know how iNat’s import gets it wrong. If I simply use curl with the URL for the page, I see links to images like:

<img src="//upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Sydney_from_Botany_Bay_looking_north_%28aerial%29.jpg/264px-Sydney_from_Botany_Bay_looking_north_%28aerial%29.jpg"
decoding="async" width="264" height="198" class="mw-file-element" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Sydney_from_Botany_Bay_looking_north_%28aerial%29.jpg/396px-Sydney_from_Botany_Bay_looking_north_%28aerial%29.jpg 1.5x,
//upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Sydney_from_Botany_Bay_looking_north_%28aerial%29.jpg/528px-Sydney_from_Botany_Bay_looking_north_%28aerial%29.jpg 2x"
data-file-width="1600" data-file-height="1200" />

That looks correct to me! I wonder if iNat uses a Wikipedia API endpoint that outputs the content in a way that breaks these links?

benarmstrong · October 28, 2024, 2:15pm

I have written a lot on this post, so I want to sum up briefly with a couple of requests to the iNat devs [edit] (see below: now reported separately in Bug Reports):

Please see if HTML fragments offered up from our website like /places/wikipedia/* can be excluded from indexing by search engine bots. That seems like a straightforward thing to do.
Please review how the rails app is handling results from querying Wikipedia. Substituting http:// for // seems questionable here:

https://github.com/inaturalist/inaturalist/blob/6352bbb3d5ff59b9c2dfad8c5556426a183f4e8c/app/controllers/shared/wikipedia_module.rb#L21-L22

  module WikipediaModule
    def wikipedia
      @title ||= params[:id]
      coder = HTMLEntities.new
      w = @wikipedia = WikipediaService.new
      @decoded = ""
      begin
        query_results = w.query(
          titles: @title,
          redirects: "",
          prop: "revisions",
          rvprop: "content"
        )
        raw = query_results.blank? ? nil : query_results.at( "page" )
        unless raw.blank? || raw["missing"]
          parsed = w.parse( page: raw["title"] )&.at( "text" )&.try( :inner_text )&.to_s
          @decoded = coder.decode( parsed )
          @decoded.gsub!( 'href="//', 'href="http://' )
          @decoded.gsub!( 'src="//', 'src="http://' )
          @decoded.gsub!( 'href="/', "href=\"#{w.base_url}/" )
          @decoded.gsub!( 'src="/', "src=\"#{w.base_url}/" )
          filter_wikipedia_content
          …

I didn’t do a very deep read of the code so I don’t know if this is the culprit or not, but it looks suspicious enough to warrant having a look.

[edit] ~~I’d be happy to file bug reports either in Github or in Bug Reports on the forum as needed. Just let me know which is preferred.~~ I have now filed them in Bug Reports

bazwal · October 28, 2024, 10:26pm

I don’t think this is entirely right. I am using Firefox 131.0.3 on Linux and I don’t have any problems. On my desktop system, the Network Dev Tool shows all the http image links as being resolved to https. It seems just as likely that the image blocking you are seeing is platform-specific and/or due to differences in security settings - or maybe even due to regional differences (I’m from the UK). What are your Firefox DNS over HTTPS and HTTPs -Only Mode settings? If you click on one of the blocked images in the Network Dev Tool, what do you see in the response headers?

Note that the iNaturalist page given by the OP is essentially the same as this one: https://en.wikipedia.org/w/index.php?action=render&title=Botany%20Bay. These pages are always delivered without css, which explains the formatting differences (i.e. it’s up to the client to supply the css).

It’s also worth noting what Wikipedia has to say about its use of Protocol-relative URLs, particularly with regard to potential breakage of http links since the site became 100% HTTPS. The related technical discussion here is similarly informative. This seems relevant to how iNaturalist is currently handling the data it gets via the Wikipedia APIs.

benarmstrong · October 28, 2024, 10:47pm

Enabling HTTPS-only mode (which I have disabled) is a workaround, yes. I tested that just now and can confirm that images in the Botany Bay iNat place “About Botany Bay” tab are shown when that option is enabled.

What I meant by “browser-specific” was simply that features of different browsers (like HTTPS-only, which we can’t assume everyone uses - but also responsive images, covered in my later response on this thread) result in different experiences for different users. That’s a distraction from the real issue, though, which is that the iNat site is generating an unacceptable mix of https: content and http: embedded content. This appears to be due to an inexplicable substitution of src="// with src="http:// by the wikipedia module in the inaturalist code. That needs to be fixed in the iNat source, as not all browsers will automatically fix the issue (or dodge it entirely, in the case of responsive images on small displays, since that substitution does not occur for the srcset attribute).

system · December 27, 2024, 10:47pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.