php web rss crawler?

0 ⤊

php web rss crawler?

I was wondering if anyone knew of some code that i could use to crawl a website looking for rss feeds, kind of like how firefox detects rss feeds in a webpage.

2006-10-30 17:10:26 · 3 answers · asked by Brady 3 in Computers & Internet ➔ Programming & Design

i want to replicate what firefox does on my website. And firefox uses the head tag but i also believe it searches for .xml links and sees if their rss.

2006-10-30 17:48:16 · update #1

there isn't a website that i've seen that does this, thats part of the reason i'm trying to create one. I'd be happy with just looking in the head tag for one, but i don't even know how to go about crawling pages in php.

My idea is to have the user enter the website of a site they are looking for rss feeds on, say discovery.com. Then i crawl 3 levels or so looking for rss feeds then provide a list for them, so they can add them to their custom home page at my website. (my site is an online rss reader)

http://www.globalnewsnow.info

thanks for the response

2006-11-01 07:45:32 · update #2

do you know of any good and easy to use ones?.. i'm really lost on this, never used php to do this sort of thing. What is the easiest way to search the reponse for the rss tag?

Thanks for sticken with me on this

2006-11-01 14:42:08 · update #3

3 answers

Conceptually, the code is very simple:

1. Find all hyperlinks in the home page that point to pages in the same domain and put them into a queue.

2. For every link in the queue, send a HEAD request to its URL. If "Content-Type:" header is present and is one of the content types used by RSS feeds (text/xml, application/xml, application/rss+xml, or somesuch), send a GET request and attempt to parse the body of the response as an RSS feed.

3. While trying URLs, keep adding to the queue all hyperlinks pointing to pages in the same domain found in the currently viewed page.

The problem with doing it in PHP is going to be the time limit. PHP scripts are typically killed after using 30 seconds of processor time. So you'll have to find a way to redirect the script to itself every now and then and store the queue between redirects.

2006-11-06 05:24:25 · answer #1 · answered by NC 7 · 0⤊ 0⤋

Firefox knows there is a feed from the no crawling needed... Look at this (this yahoo answers page)

title="Yahoo! Answers: Answers and Comments for php web rss crawler?" href="/rss/question?qid=20061030221026AAor9NY" />

says there is an rss feed for the page.

Update:

there is a rarely used variant of the that contains a rel= attribute that could be used for multiple rss links and rel=alternate is even more rare... but you would just have to scrape (search) and not crawl.

to crawl every link to find which ones are rss would find very little.

What is your website where you see this behavior?

Update 2:
You want to crawl web pages searching for rss feeds.
that's a different story.

Any simple php or perl crawler (or even the utlil wget) will do it... then simply grep for the rss feeds.

2006-10-31 01:34:11 · answer #2 · answered by jake cigar™ is retired 7 · 0⤊ 0⤋

Firefox mostly uses the LINK tags in the HEAD tag to get the RSS feed. It finds the one with the right content type I believe. That's covered.

Jake Cigar noted that you can attach a REL attribute to a link element. This describes the relationship of the linked resource, and I use it on my websites (it's good semantically as well). Barely any sites use this though, but you can look in these A tags and find one with a right REL. You can also look at each A tag to check the URL. Example:
That will actually fetch all the absolute URLs out of the page that are surrounded by double quotes ("). Clearly my example will have to be expanded upon to support more types of URLs.

To crawl, you will have to get all the URLs on the page. Regular expressions can be used to match links to other resources. Once you've got a list of URLs, for each one, download it with PHP and check its content and/or it's content type. Crawling is probably unnecessary though. Most sites publish their RSS as a LINK in the HEAD tag.

2006-11-02 23:37:46 · answer #3 · answered by sk89q 2 · 0⤊ 0⤋