Saturday, November 18, 2017

google chrome - Why do images from some Tumblr pages not load, but using wget on them works?



Helping a friend out with their Internet connection because “some pages won’t load”, I noticed that the problem was that the images of certain blogs' image posts weren’t loading on the browser. I found it weird because of the following reasons:





  1. Only images that are part of the post won’t load. User avatars, banners, headers, various theme and/or page-related images still appear.

  2. Happens with any browser on the computer (Tested on Firefox and Chrome/ium both with and without ad/script blockers).

  3. Using wget on the images' direct links works.

  4. This does not apply to all Tumblr pages. Most load properly, but when making a list of pages with posts that don’t load images show that they’re mostly from the same bunch of users.

  5. The problem seems to be blog-specific in the sense that if a certain blog's image post doesn't load in the browser, other blogs (unaffected or not) that reblogged the same post won't load the image in the browser as well. Conversely, if an affected blog is reblogs from an unaffected one, the image loads fine.

  6. The images are from user-created Tumblr posts where the user uploads an image to post and are hosted by Tumblr. For example (this example is not one of the affected blogs), in this image post (randomly selected), this would be the direct link to the image in the post. Image posts automatically make the images a link to another page in Tumblr using a (usually) larger version of the image used in the post that is closer to the size of what the user uploaded for the post.



What can possibly be the reason for this happening? The part that really gets me is the fact that wget works, so I think I can assume that it’s not a problem with the network connection.




Update:



Here is an example of a reblogged post that fails to load on the browsers. The main blog has other image posts that load properly. This is the direct link to the image in the post and here is the one for the bigger version (both don't load here). wget works for both, but upon going to any direct link with Firefox, this error appears:



This XML file does not appear to have any style information associated with it. The document tree is shown below.


AccessDenied
Access Denied
A626307DF577B411

J9GxX1HY9vX3ElWjYf7M48ByvKXLRIwRBJ2al2voS3J/C+WhILWHyd3crFhhNtkXuvG0zaxBTxw=



RequestID and HostId changes every time. My friend and I are located in the Philippines.



Update [2014/03/08]



Upon further tests and replying to the emails of Tumblr support, wget has stopped working (getting 403 errors on direct links) on some occasions.




Update [2014/03/09]



Turning off the Tumblr rules for HTTPS-Everywhere seems to sometimes fix the problem.






Note:




  • In the example for #6, direct links both point to the same image. Usually, though, the one used in the image post (as compared to the zoomable image page) uses a smaller version of the image to fit the theme of the page. The example uses a theme made for larger screens so it does not need the smaller version.



Answer



UPDATE: It seems the core issue with images not loading stemmed from the way the EFF’s HTTPS Everywhere plugin/extension handled some Tumblr URLs. The developer’s were notified and a fix appears to be in place. This answer basically breaks down the detective work done to uncover the issue as outlined by the initial question and could prove useful for further debugging/diagnosis if a similar issue appears in the future.






EDIT: The larger content about image leeching seems invalid. So will add a new idea at the top and leave the image leeching info at the bottom just in case it is useful to someone.



Amazon CloudFront CDN Ideas




Okay, using the URLs you have provided—as well as some of my real world experience with Amazon CloudFront CDN setups—I think I discovered something. It seems like Tumblr’s Amazon CloudFront CDN config is choking for some reason. Here is why I think that is the case.



Let’s take this example URL:



http://36.media.tumblr.com/d685b02fdf2d3f167c22d9a97e27e87a/tumblr_nfpq5qPZ4v1tognpro1_1280.png


Now let’s run curl -I to get header information on that file:




curl -I http://36.media.tumblr.com/d685b02fdf2d3f167c22d9a97e27e87a/tumblr_nfpq5qPZ4v1tognpro1_1280.png


The output for that would be something like this:



HTTP/1.1 200 OK
Content-Type: image/png
Content-Length: 782141
Connection: keep-alive
Accept-Ranges: bytes

Cache-Control: max-age=1209600
Date: Thu, 05 Mar 2015 02:15:44 GMT
Server: nginx
X-Cache: Miss from cloudfront
Via: 1.1 7e54fc06cd70e4752fe050bbe5c130be.cloudfront.net (CloudFront)
X-Amz-Cf-Id: QyIUyzfaJJN3PU_xWkW0P-D2kjg_1cVenKzFAoY2PubgZQlBHWorZQ==


Now the things to pay attention to here are the Date (the date and time of the file on the CloudFront endpoint) and X-Cache (Amazon content delivery status) headers. Typical behavior on Amazon CloudFront is the first access will convey a “Miss from cloudfront” and then if you do another curl -I right away afterwards there should be a Hit from cloudfront.




But that’s not what I saw just now. Here is a breakdown of the Date and X-Cache status of a bunch of accesses I made:




  • Date: Thu, 05 Mar 2015 02:19:37 GMT = X-Cache: Miss from cloudfront

  • Date: Thu, 05 Mar 2015 02:19:39 GMT = X-Cache: Miss from cloudfront

  • Date: Thu, 05 Mar 2015 02:19:44 GMT = X-Cache: Miss from cloudfront

  • Date: Thu, 05 Mar 2015 02:19:50 GMT = X-Cache: Miss from cloudfront

  • Date: Thu, 05 Mar 2015 02:19:50 GMT = X-Cache: Hit from cloudfront

  • Date: Thu, 05 Mar 2015 02:19:50 GMT = X-Cache: Hit from cloudfront

  • Date: Thu, 05 Mar 2015 02:19:50 GMT = X-Cache: Hit from cloudfront




The reason why there are multiple items with the same exact data which are Hit from cloudfront near the end is because that is what happens on a CDN: If the endpoint of the CDN has the file, then Date correlates to the actual creation/modification date of the file that endpoint has.



You notice the first four access are seconds apart, with different dates/times and all of them are Miss from cloudfront, right? That means the CDN endpoint is just echoing back that there was an attempt to access that file at those times and all attempts were misses.



So my armchair assessment of this is that Tumblr’s systems are not keeping up with the Amazon CloudFront CDN or the Amazon CloudFront CDN is not keeping up with Tumblr. But in some way, things are amiss on their server side. And since this is a CDN, someone accessing the files in one location might not notice an issue while someone else in another location would have issues viewing the image.



Which is all to say, I don’t think this can easily be cleared up on the client side.







EDIT: So the original poster added some new URLs, and this still points to a server-side issue, but I just wanted to post the details for the record.



EdgeCast & Highwinds CDN Ideas



So the original poster added more specifics, so here are more details based on the blog post that is being used as an example:



http://claystorks.tumblr.com/post/112741831192/soulmister-claystorks-windspeare-explain



And these image URLs are provided as examples of URLs in that post:



https://gs1.wac.edgecastcdn.net/8019B6/data.tumblr.com/76493f424ebb3b62d6de43e53643180a/tumblr_nkps82DdCh1sjn35qo1_500.png

https://gs1.wac.edgecastcdn.net/8019B6/data.tumblr.com/76493f424ebb3b62d6de43e53643180a/tumblr_nkps82DdCh1sjn35qo1_1280.png


And those two image URLs do indeed fail. But from my side—looking at the original soure code of the blog post from Brooklyn, New York, USA—I am not seeing those EdgeCast (gs1.wac.edgecastcdn.net) URLs. Rather, these are the URLs I am seeing:




http://41.media.tumblr.com/76493f424ebb3b62d6de43e53643180a/tumblr_nkps82DdCh1sjn35qo1_500.png

http://41.media.tumblr.com/76493f424ebb3b62d6de43e53643180a/tumblr_nkps82DdCh1sjn35qo1_1280.png


So my first thought is why is the original poster seeing those EdgeCast (gs1.wac.edgecastcdn.net). But then if I do a traceroute to the 41.media.tumblr.com I see that is a server managed by Highwinds (!?!?). In contrast the initial URLs passed on by the original user are using the 36.media.tumblr.com hostname and you can see they are managed by Amazon CloudFront CDN servers.



Which is all to say—which I said before—all of this seems to be a server side issue with Tumblr and their CDN management. But from my side—in Brooklyn, New York, USA—I am clearly seeing content being delivered as expected from Highwinds CDN servers as well as Amazon CloudFront CDN servers. Where these EdgeCast URLS are coming from or how/why they are then failing is out of anyone’s control on the client side. This would definitely be something to contact Tumblr tech staff about because there is no way a desktop end-user could resolve this.







Image Leeching Ideas



Might not be relevant anymore, but here for reference.



You stating this give me a clue:




Using wget on the images' direct links works.





Many sites have rules in place—usually set via Apache—that prevent image leeching. More details on how those rules work are provided here and is summarized as this:




Using .htaccess, you can disallow hot linking on your server, so those
attempting to link to an image or CSS file on your site, for example,
is either blocked (failed request, such as a broken image) or served a
different content (ie: an image of an angry man).





Based on your description—and the fact you can access the images via wget—leads me to believe that the images you are having issues with are not hosted on Tumblr by users, but rather images that are placed on a Tumblr blog but actually hosted on another site.



When standard image leeching procedures are put in place, viewing an embedded image on one site that is hosted on another site—which blocks leeching—would result in a broken image link or perhaps a “Stop Leeching!” image being returned. This is because basic anti-leeching rules—such as those in that example page—crosscheck image referrers to make sure the page requesting the image matches the domain hosting the image.



So when you are accessing the image via wget you are accessing the image directly. So image leeching rules would not kick in. Thus you can get the image via wget but not when it is embedded in another page.


No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...