PHP: Scraping webpages which use the "pre" tag?

Live forum:


15-12-2008 14:19:04

Hey geniuses. I'm trying to learn webpage scraping and I'd like to be able to scrape the content of pages like this


and store the data in a database. Basically, I want to be able to grab each line with a book listed so that I can parse it for the data on that book. I'm getting stuck with preg_match because of the "pre" HTML tags. Suggestions?

BTW what I've tried so far is to get the whole list of books like this

[code1344e6a16ef]preg_match( '/TITLE................................\n(.li?)\n</pre>\n<hr></hr>\n<pre>\nPlease/', $ubc_bookstore_text_listings, $all_books );[/code1344e6a16ef]

but it's not working (I suck at regex).


15-12-2008 14:44:54

Not sure why pre tags would give you a problem, they're just text within the data stream, nothing special. I suspect it's your regex, as it appears a bit odd and pretty inflexible the way you've done it. Looks to me like you're being far too dependent on things, with hardcoding the number of dots after TITLE for example (and btw the 'dot' character is a wildcard, so if looking for literal dots, you must escape them with \). Also you're trying to match too much after the desired book list, when all you need to look for is the closing /pre tag -- not the two hr's with newlines and closing "Please..." paragraph.

I would be looking for TITLE followed by a variable number of dots followed by a newline, something like
(going from memory, don't take that as perfect -- I always have to remind myself of PHP's flavor of regex syntax, there are several types and levels of regex and I can never remember who supports what)

But looking at the source for the page you linked to, I see an even easier solution (especially if you're going to hardcode layout info anyway). Parse to the second pre tag. The following line is the column header line, read and parse it or throw it away. Then read into your array every line after that until you get to the /pre closing tag. Bang, you're done. No convoluted regex needed, although I have to admit I'm a big fan of elegent regex that remove a lot of hardcoded layout assumptions. They're just typically very hard to debug when they go wrong. I used to use a Windows GUI tool that helped visualize regexes as you built them, but I haven't done heavy-duty regex work in awhile and forget the name of the tool.