दिलचस्प पोस्ट
नेस्टेड सूची समझ को समझना गिट का दोष: आंकड़े किसी व्यक्ति ने प्रकाशित शाखा को रीबेस या रीसेट करने के बाद मैं कैसे ठीक / पुन: सिंक्रनाइज़ कर सकता हूं? iPhone ऐप में स्थानीय नोटिफिकेशन कैसे बनाएं एआरसी का उपयोग करने के लिए किसी प्रोजेक्ट को परिवर्तित करते समय "स्विच केस संरक्षित क्षेत्र" में क्या होता है? प्रति लाइन बैच रंग एक सेट के माध्यम से संयोजित करने का एक अच्छा तरीका क्या है? एक ArrayAdapter से आइटम क्यों नहीं जोड़ सकता / निकाला जा सकता है? फ़ायरफ़ॉक्स में event.offsetX एसएसएच के माध्यम से एक्लिप्स के साथ एक दूरस्थ परियोजना पर कार्य करें क्या मैं सीएसएस में एक छवि की ऊंचाई को बदल सकता हूँ: पहले /: छद्म तत्वों के बाद? अजगर फ़िल्टर के बराबर () दो आउटपुट सूचियों (अर्थात सूची का विभाजन) हो रहा है लघुकरण के दौरान डीबग जावास्क्रिप्ट कोड बाहर निकालें (पायथन) इसे स्थापित करने के बजाय स्थानीय स्तर पर एक पुस्तकालय का उपयोग करें एक खाली JQuery ऑब्जेक्ट प्राप्त करना

मैं साइट स्क्रैपिंग को कैसे रोकूं?

मेरे पास बड़े कलाकार डेटाबेस के साथ काफी बड़ी वेबसाइट है मैं हमारी साइट के डेटा को स्क्रैप करने वाले अन्य संगीत साइटों को देख रहा हूं (मैं यहां डमी कलाकार नाम दर्ज करता हूं और फिर उनके लिए Google खोज करता हूं)।

मैं स्क्रीन स्क्रैपिंग कैसे रोक सकता हूं? क्या यह संभव है?

वेब के समाधान से एकत्रित समाधान "मैं साइट स्क्रैपिंग को कैसे रोकूं?"

मुझे लगता है कि आपने robots.txt सेट अप किया है।

जैसा कि अन्य लोगों ने उल्लेख किया है, स्कैपर उनकी गतिविधियों के लगभग हर पहलू को नकली कर सकते हैं, और बुरे लोगों से आने वाले अनुरोधों की पहचान करना शायद मुश्किल है।

मैं विचार करेगा:

  1. एक पृष्ठ सेट करें, /jail.html
  2. रोबोट्सटैक्स में पेज तक पहुंच अस्वीकृत करें (इसलिए सम्मानित मकड़ियों कभी नहीं आएंगे)
  3. अपने पृष्ठों में से एक पर एक लिंक डालें, इसे सीएसएस के साथ छुपाएं ( display: none )।
  4. /jail.html विज़िटर के आईपी पते रिकॉर्ड करें।

यह आपको स्क्रैपर से अनुरोधों को शीघ्रता से पहचानने में मदद कर सकता है जो आपके robots.txt . robots.txt को ध्वन्यात्मक रूप से अनदेखा कर रहे हैं।

हो सकता है कि आप अपनी /jail.html एक पूरी पूरी वेबसाइट बना /jail.html , जिसमें सामान्य पृष्ठों के समान, सटीक मार्कअप वाला है, लेकिन नकली डेटा ( /jail/album/63ajdka , /jail/track/3aads8 आदि) के साथ। इस तरह, बुरे स्कैपर को "असामान्य इनपुट" तक सतर्क नहीं किया जाएगा जब तक आपके पास उन्हें पूरी तरह से ब्लॉक करने का मौका नहीं मिलता है।

Sue 'em.

Seriously: If you have some money, talk to a good, nice, young lawyer who knows their way around the Internets. You could really be able to do something here. Depending on where the sites are based, you could have a lawyer write up a cease & desist or its equivalent in your country. You may be able to at least scare the bastards.

Document the insertion of your dummy values. Insert dummy values that clearly (but obscurely) point to you. I think this is common practice with phone book companies, and here in Germany, I think there have been several instances when copycats got busted through fake entries they copied 1:1.

It would be a shame if this would drive you into messing up your HTML code, dragging down SEO, validity and other things (even though a templating system that uses a slightly different HTML structure on each request for identical pages might already help a lot against scrapers that always rely on HTML structures and class/ID names to get the content out.)

Cases like this are what copyright laws are good for. Ripping off other people's honest work to make money with is something that you should be able to fight against.

There is really nothing you can do to completely prevent this. Scrapers can fake their user agent, use multiple IP addresses, etc. and appear as a normal user. The only thing you can do is make the text not available at the time the page is loaded – make it with image, flash, or load it with JavaScript. However, the first two are bad ideas, and the last one would be an accessibility issue if JavaScript is not enabled for some of your regular users.

If they are absolutely slamming your site and rifling through all of your pages, you could do some kind of rate limiting.

There is some hope though. Scrapers rely on your site's data being in a consistent format. If you could randomize it somehow it could break their scraper. Things like changing the ID or class names of page elements on each load, etc. But that is a lot of work to do and I'm not sure if it's worth it. And even then, they could probably get around it with enough dedication.

Provide an XML API to access your data; in a manner that is simple to use. If people want your data, they'll get it, you might as well go all out.

This way you can provide a subset of functionality in an effective manner, ensuring that, at the very least, the scrapers won't guzzle up HTTP requests and massive amounts of bandwidth.

Then all you have to do is convince the people who want your data to use the API. 😉

Sorry, it's really quite hard to do this…

I would suggest that you politely ask them to not use your content (if your content is copyrighted).

If it is and they don't take it down, then you can take furthur action and send them a cease and desist letter .

Generally, whatever you do to prevent scraping will probably end up with a more negative effect, eg accessibility, bots/spiders, etc.

Okay, as all posts say, if you want to make it search engine-friendly then bots can scrape for sure.

But you can still do a few things, and it may be affective for 60-70 % scraping bots.

Make a checker script like below.

If a particular IP address is visiting very fast then after a few visits (5-10) put its IP address + browser information in a file or database.

The next step

(This would be a background process and running all time or scheduled after a few minutes.) Make one another script that will keep on checking those suspicious IP addresses.

Case 1. If the user Agent is of a known search engine like Google, Bing , Yahoo (you can find more information on user agents by googling it). Then you must see http://www.iplists.com/ . This list and try to match patterns. And if it seems like a faked user-agent then ask to fill in a CAPTCHA on the next visit. (You need to research a bit more on bots IP addresses. I know this is achievable and also try whois of the IP address. It can be helpful.)

Case 2. No user agent of a search bot: Simply ask to fill in a CAPTCHA on the next visit.

I have done a lot of web scraping and summarized some techniques to stop web scrapers on my blog based on what I find annoying.

It is a tradeoff between your users and scrapers. If you limit IP's, use CAPTCHA's, require login, etc, you make like difficult for the scrapers. But this may also drive away your genuine users.

Your best option is unfortunately fairly manual: Look for traffic patterns that you believe are indicative of scraping and ban their IP addresses.

Since you're talking about a public site then making the site search-engine friendly will also make the site scraping-friendly. If a search-engine can crawl and scrape your site then an malicious scraper can as well. It's a fine-line to walk.

From a tech perspective: Just model what Google does when you hit them with too many queries at once. That should put a halt to a lot of it.

From a legal perspective: It sounds like the data you're publishing is not proprietary. Meaning you're publishing names and stats and other information that cannot be copyrighted.

If this is the case, the scrapers are not violating copyright by redistributing your information about artist name etc. However, they may be violating copyright when they load your site into memory because your site contains elements that are copyrightable (like layout etc).

I recommend reading about Facebook v. Power.com and seeing the arguments Facebook used to stop screen scraping. There are many legal ways you can go about trying to stop someone from scraping your website. They can be far reaching and imaginative. Sometimes the courts buy the arguments. Sometimes they don't.

But, assuming you're publishing public domain information that's not copyrightable like names and basic stats… you should just let it go in the name of free speech and open data. That is, what the web's all about.

Things that might work against beginner scrapers:

  • IP blocking
  • use lots of ajax
  • check referer request header
  • require login

Things that will help in general:

  • change your layout every week
  • robots.txt

Things that will help but will make your users hate you:

  • captcha

Late answer – and also this answer probably isn't the one you want to hear…

Myself already wrote many (many tens) of different specialized data-mining scrapers. (just because I like the "open data" philosophy).

Here are already many advices in other answers – now i will play the devil's advocate role and will extend and/or correct their effectiveness.

प्रथम:

  • if someone really wants your data
  • you can't effectively (technically) hide your data
  • if the data should be publicly accessible to your "regular users"

Trying to use some technical barriers aren't worth the troubles, caused:

  • to your regular users by worsening their user-experience
  • to regular and welcomed bots (search engines)
  • आदि…

Plain HMTL – the easiest way is parse the plain HTML pages, with well defined structure and css classes. Eg it is enough to inspect element with Firebug, and use the right Xpaths, and/or CSS path in my scraper.

You could generate the HTML structure dynamically and also, you can generate dynamically the CSS class-names (and the CSS itself too) (eg by using some random class names) – but

  • you want to present the informations to your regular users in consistent way
  • eg again – it is enough to analyze the page structure once more to setup the scraper.
  • and it can be done automatically by analyzing some "already known content"
    • once someone already knows (by earlier scrape), eg:
    • what contains the informations about "phil collins"
    • enough display the "phil collins" page and (automatically) analyze how the page is structured "today" 🙂

You can't change the structure for every response, because your regular users will hate you. Also, this will cause more troubles for you (maintenance) not for the scraper. The XPath or CSS path is determinable by the scraping script automatically from the known content.

Ajax – little bit harder in the start, but many times speeds up the scraping process 🙂 – why?

When analyzing the requests and responses, i just setup my own proxy server (written in perl) and my firefox is using it. Of course, because it is my own proxy – it is completely hidden – the target server see it as regular browser. (So, no X-Forwarded-for and such headers). Based on the proxy logs, mostly is possible to determine the "logic" of the ajax requests, eg i could skip most of the html scraping, and just use the well-structured ajax responses (mostly in JSON format).

So, the ajax doesn't helps much…

Some more complicated are pages which uses much packed javascript functions .

Here is possible to use two basic methods:

  • unpack and understand the JS and create a scraper which follows the Javascript logic (the hard way)
  • or (preferably using by myself) – just using Mozilla with Mozrepl for scrape. Eg the real scraping is done in full featured javascript enabled browser, which is programmed to clicking to the right elements and just grabbing the "decoded" responses directly from the browser window.

Such scraping is slow (the scraping is done as in regular browser), but it is

  • very easy to setup and use
  • and it is nearly impossible to counter it 🙂
  • and the "slowness" is needed anyway to counter the "blocking the rapid same IP based requests"

The User-Agent based filtering doesn't helps at all. Any serious data-miner will set it to some correct one in his scraper.

Require Login – doesn't helps. The simplest way beat it (without any analyze and/or scripting the login-protocol) is just logging into the site as regular user, using Mozilla and after just run the Mozrepl based scraper…

Remember, the require login helps for anonymous bots, but doesn't helps against someone who want scrape your data. He just register himself to your site as regular user.

Using frames isn't very effective also. This is used by many live movie services and it not very hard to beat. The frames are simply another one HTML/Javascript pages what are needed to analyze… If the data worth the troubles – the data-miner will do the required analyze.

IP-based limiting isn't effective at all – here are too many public proxy servers and also here is the TOR… 🙂 It doesn't slows down the scraping (for someone who really wants your data).

Very hard is scrape data hidden in images. (eg simply converting the data into images server-side). Employing "tesseract" (OCR) helps many times – but honestly – the data must worth the troubles for the scraper. (which many times doesn't worth).

On the other side, your users will hate you for this. Myself, (even when not scraping) hate websites which doesn't allows copy the page content into the clipboard (because the information are in the images, or (the silly ones) trying to bond to the right click some custom Javascript event. 🙂

The hardest are the sites which using java applets or flash , and the applet uses secure https requests itself internally . But think twice – how happy will be your iPhone users… ;). Therefore, currently very few sites using them. Myself, blocking all flash content in my browser (in regular browsing sessions) – and never using sites which depends on Flash.

Your milestones could be…, so you can try this method – just remember – you will probably loose some of your users. Also remember, some SWF files are decompilable. 😉

Captcha (the good ones – like reCaptcha) helps a lot – but your users will hate you… – just imagine, how your users will love you when they need solve some captchas in all pages showing informations about the music artists.

Probably don't need to continue – you already got into the picture.

Now what you should do:

Remember: It is nearly impossible to hide your data, if you on the other side want publish them (in friendly way) to your regular users.

इसलिए,

  • make your data easily accessible – by some API
    • this allows the easy data access
    • eg offload your server from scraping – good for you
  • setup the right usage rights (eg for example must cite the source)
  • remember, many data isn't copyright-able – and hard to protect them
  • add some fake data (as you already done) and use legal tools
    • as others already said, send an "cease and desist letter"
    • other legal actions (sue and like) probably is too costly and hard to win (especially against non US sites)

Think twice before you will try to use some technical barriers.

Rather as trying block the data-miners, just add more efforts to your website usability. Your user will love you. The time (&energy) invested into technical barriers usually aren't worth – better to spend the time to make even better website…

Also, data-thieves aren't like normal thieves.

If you buy an inexpensive home alarm and add an warning "this house is connected to the police" – many thieves will not even try to break into. Because one wrong move by him – and he going to jail…

So, you investing only few bucks, but the thief investing and risk much.

But the data-thief hasn't such risks. just the opposite – ff you make one wrong move (eg if you introduce some BUG as a result of technical barriers), you will loose your users. If the the scraping bot will not work for the first time, nothing happens – the data-miner just will try another approach and/or will debug the script.

In this case, you need invest much more – and the scraper investing much less.

Just think where you want invest your time & energy…

Ps: english isn't my native – so forgive my broken english…

Sure it's possible. For 100% success, take your site offline.

In reality you can do some things that make scraping a little more difficult. Google does browser checks to make sure you're not a robot scraping search results (although this, like most everything else, can be spoofed).

You can do things like require several seconds between the first connection to your site, and subsequent clicks. I'm not sure what the ideal time would be or exactly how to do it, but that's another idea.

I'm sure there are several other people who have a lot more experience, but I hope those ideas are at least somewhat helpful.

There are a few things you can do to try and prevent screen scraping. Some are not very effective, while others (a CAPTCHA) are, but hinder usability. You have to keep in mind too that it may hinder legitimate site scrapers, such as search engine indexes.

However, I assume that if you don't want it scraped that means you don't want search engines to index it either.

Here are some things you can try:

  • Show the text in an image. This is quite reliable, and is less of a pain on the user than a CAPTCHA, but means they won't be able to cut and paste and it won't scale prettily or be accessible.
  • Use a CAPTCHA and require it to be completed before returning the page. This is a reliable method, but also the biggest pain to impose on a user.
  • Require the user to sign up for an account before viewing the pages, and confirm their email address. This will be pretty effective, but not totally – a screen-scraper might set up an account and might cleverly program their script to log in for them.
  • If the client's user-agent string is empty, block access. A site-scraping script will often be lazily programmed and won't set a user-agent string, whereas all web browsers will.
  • You can set up a black list of known screen scraper user-agent strings as you discover them. Again, this will only help the lazily-coded ones; a programmer who knows what he's doing can set a user-agent string to impersonate a web browser.
  • Change the URL path often. When you change it, make sure the old one keeps working, but only for as long as one user is likely to have their browser open. Make it hard to predict what the new URL path will be. This will make it difficult for scripts to grab it if their URL is hard-coded. It'd be best to do this with some kind of script.

If I had to do this, I'd probably use a combination of the last three, because they minimise the inconvenience to legitimate users. However, you'd have to accept that you won't be able to block everyone this way and once someone figures out how to get around it, they'll be able to scrape it forever. You could then just try to block their IP addresses as you discover them I guess.

  1. No, it's not possible to stop (in any way)
  2. Embrace it. Why not publish as RDFa and become super search engine friendly and encourage the re-use of data? People will thank you and provide credit where due (see musicbrainz as an example).

It is not the answer you probably want, but why hide what you're trying to make public?

Method One (Small Sites Only):
Serve encrypted / encoded data.
I Scape the web using python (urllib, requests, beautifulSoup etc…) and found many websites that serve encrypted / encoded data that is not decrypt-able in any programming language simply because the encryption method does not exist.

I achieved this in a PHP website by encrypting and minimizing the output (WARNING: this is not a good idea for large sites) the response was always jumbled content.

Example of minimizing output in PHP ( How to minify php page html output? ):

 <?php function sanitize_output($buffer) { $search = array( '/\>[^\S ]+/s', // strip whitespaces after tags, except space '/[^\S ]+\</s', // strip whitespaces before tags, except space '/(\s)+/s' // shorten multiple whitespace sequences ); $replace = array('>', '<', '\\1'); $buffer = preg_replace($search, $replace, $buffer); return $buffer; } ob_start("sanitize_output"); ?> 

Method Two:
If you can't stop them screw them over serve fake / useless data as a response.

Method Three:
block common scraping user agents, you'll see this in major / large websites as it is impossible to scrape them with "python3.4" as you User-Agent.

Method Four:
Make sure all the user headers are valid, I sometimes provide as many headers as possible to make my scraper seem like an authentic user, some of them are not even true or valid like en-FU :).
Here is a list of some of the headers I commonly provide.

 headers = { "Requested-URI": "/example", "Request-Method": "GET", "Remote-IP-Address": "656.787.909.121", "Remote-IP-Port": "69696", "Protocol-version": "HTTP/1.1", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding": "gzip,deflate", "Accept-Language": "en-FU,en;q=0.8", "Cache-Control": "max-age=0", "Connection": "keep-alive", "Dnt": "1", "Host": "http://example.com", "Referer": "http://example.com", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36" } 

Rather than blacklisting bots, maybe you should whitelist them. If you don't want to kill your search results for the top few engines, you can whitelist their user-agent strings, which are generally well-publicized. The less ethical bots tend to forge user-agent strings of popular web browsers. The top few search engines should be driving upwards of 95% of your traffic.

Identifying the bots themselves should be fairly straightforward, using the techniques other posters have suggested.

Quick approach to this would be to set a booby/bot trap.

  1. Make a page that if it's opened a certain amount of times or even opened at all, will collect certain information like the IP and whatnot (you can also consider irregularities or patterns but this page shouldn't have to be opened at all).

  2. Make a link to this in your page that is hidden with CSS display:none; or left:-9999px; positon:absolute; try to place it in places that are less unlikely to be ignored like where your content falls under and not your footer as sometimes bots can choose to forget about certain parts of a page.

  3. In your robots.txt file set a whole bunch of disallow rules to pages you don't want friendly bots (LOL, like they have happy faces!) to gather information on and set this page as one of them.

  4. Now, If a friendly bot comes through it should ignore that page. Right but that still isn't good enough. Make a couple more of these pages or somehow re-route a page to accept differnt names. and then place more disallow rules to these trap pages in your robots.txt file alongside pages you want ignored.

  5. Collect the IP of these bots or anyone that enters into these pages, don't ban them but make a function to display noodled text in your content like random numbers, copyright notices, specific text strings, display scary pictures, basically anything to hinder your good content. You can also set links that point to a page which will take forever to load ie. in php you can use the sleep() function. This will fight the crawler back if it has some sort of detection to bypass pages that take way too long to load as some well written bots are set to process X amount of links at a time.

  6. If you have made specific text strings/sentences why not go to your favorite search engine and search for them, it might show you where your content is ending up.

Anyway, if you think tactically and creatively this could be a good starting point. The best thing to do would be to learn how a bot works.

I'd also think about scambling some ID's or the way attributes on the page element are displayed:

 <a class="someclass" href="../xyz/abc" rel="nofollow" title="sometitle"> 

that changes its form every time as some bots might be set to be looking for specific patterns in your pages or targeted elements.

 <a title="sometitle" href="../xyz/abc" rel="nofollow" class="someclass"> id="p-12802" > id="p-00392" 

You can't stop normal screen scraping. For better or worse, it's the nature of the web.

You can make it so no one can access certain things (including music files) unless they're logged in as a registered user. It's not too difficult to do in Apache . I assume it wouldn't be too difficult to do in IIS as well.

One way would be to serve the content as XML attributes, URL encoded strings, preformatted text with HTML encoded JSON, or data URIs, then transform it to HTML on the client. Here are a few sites which do this:

  • Skechers : XML

     <document filename="" height="" width="" title="SKECHERS" linkType="" linkUrl="" imageMap="" href=&quot;http://www.bobsfromskechers.com&quot; alt=&quot;BOBS from Skechers&quot; title=&quot;BOBS from Skechers&quot; /> 
  • Chrome Web Store : JSON

     <script type="text/javascript" src="https://apis.google.com/js/plusone.js">{"lang": "en", "parsetags": "explicit"}</script> 
  • Bing News : data URL

     <script type="text/javascript"> //<![CDATA[ (function() { var x;x=_ge('emb7'); if(x) { x.src='data:image/jpeg;base64,/*...*/'; } }() ) 
  • Protopage : URL Encoded Strings

     unescape('Rolling%20Stone%20%3a%20Rock%20and%20Roll%20Daily') 
  • TiddlyWiki : HTML Entities + preformatted JSON

      <pre> {&quot;tiddlers&quot;: { &quot;GettingStarted&quot;: { &quot;title&quot;: &quot;GettingStarted&quot;, &quot;text&quot;: &quot;Welcome to TiddlyWiki, } } } </pre> 
  • Amazon : Lazy Loading

     amzn.copilot.jQuery=i;amzn.copilot.jQuery(document).ready(function(){d(b);f(c,function() {amzn.copilot.setup({serviceEndPoint:h.vipUrl,isContinuedSession:true})})})},f=function(i,h){var j=document.createElement("script");j.type="text/javascript";j.src=i;j.async=true;j.onload=h;a.appendChild(j)},d=function(h){var i=document.createElement("link");i.type="text/css";i.rel="stylesheet";i.href=h;a.appendChild(i)}})(); amzn.copilot.checkCoPilotSession({jsUrl : 'http://z-ecx.images-amazon.com/images/G/01/browser-scripts/cs-copilot-customer-js/cs-copilot-customer-js-min-1875890922._V1_.js', cssUrl : 'http://z-ecx.images-amazon.com/images/G/01/browser-scripts/cs-copilot-customer-css/cs-copilot-customer-css-min-2367001420._V1_.css', vipUrl : 'https://copilot.amazon.com' 
  • XMLCalabash : Namespaced XML + Custom MIME type + Custom File extension

      <p:declare-step type="pxp:zip"> <p:input port="source" sequence="true" primary="true"/> <p:input port="manifest"/> <p:output port="result"/> <p:option name="href" required="true" cx:type="xsd:anyURI"/> <p:option name="compression-method" cx:type="stored|deflated"/> <p:option name="compression-level" cx:type="smallest|fastest|default|huffman|none"/> <p:option name="command" select="'update'" cx:type="update|freshen|create|delete"/> </p:declare-step> 

If you view source on any of the above, you see that scraping will simply return metadata and navigation.

I agree with most of the posts above, and I'd like to add that the more search engine friendly your site is, the more scrape-able it would be. You could try do a couple of things that are very out there that make it harder for scrapers, but it might also affect your search-ability… It depends on how well you want your site to rank on search engines of course.

Putting your content behind a captcha would mean that robots would find it difficult to access your content. However, humans would be inconvenienced so that may be undesirable.

If you want to see a great example, check out http://www.bkstr.com/ . They use aj/s algorithm to set a cookie, then reloads the page so it can use the cookie to validate that the request is being run within a browser. A desktop app built to scrape could definitely get by this, but it would stop most cURL type scraping.

Most have been already said, but have you considered the CloudFlare protection? I mean this:

चित्र का वर्णन

Other companies probably do this too, CloudFlare is the only one I know.

I'm pretty sure that would complicate their work. I also once got IP banned automatically for 4 months when I tried to scrap data of a site protected by CloudFlare due to rate limit (I used simple AJAX request loop).

Screen scrapers work by processing HTML. And if they are determined to get your data there is not much you can do technically because the human eyeball processes anything. Legally it's already been pointed out you may have some recourse though and that would be my recommendation.

However, you can hide the critical part of your data by using non-HTML-based presentation logic

  • Generate a Flash file for each artist/album, etc.
  • Generate an image for each artist content. Maybe just an image for the artist name, etc. would be enough. Do this by rendering the text onto a JPEG / PNG file on the server and linking to that image.

Bear in mind that this would probably affect your search rankings.

Generate the HTML, CSS and JavaScript. It is easier to write generators than parsers, so you could generate each served page differently. You can no longer use a cache or static content then.