|
Kiril L.
May 8, 2007
11:06 PM
|
Disable seach-bots (eg. Google) |
| Hello, I was wondering if it was enough for me to just add a file in my WWW folder named robots.txt (with this inside it: User-agent: * Disallow: / ) to prevent Google and other search-bots from indexing my site? Or does this require a www.brandeis.edu site admin to edit a robots.txt file in a root of brandeis.edu to have this achieved? >according to Google, the robots.txt needs to be placed in the root of the site. >my current location (perhaps useless) is http://people.brandeis.edu/~wrussian/robots.txt >if it does require one of the admins to add my ID to the robots.txt file on brandeis.edu, could you please do so? Thank you. |
Elliot Kendall
May 9, 2007
08:32:10 AM
|
If you're perversely interested, the standard for robots.txt files is available here. The short answer, however, is that the file must be in the web server's root directory to be valid.
If you like, I can add your site to the global robots.txt file on people.brandeis.edu. This may not achieve the effect you're looking for, though. While GoogleBot and other well-behaved robots will read the file and respect its contents, there are many, many nefarious bots operating on the web today. These bots will not only ignore the contents of the file, they will specifically use it to index directories that aren't linked from anywhere.
Furthermore, Google and company won't necessarily stop listing your site because robots.txt has changed. They may simply stop crawling it looking for changes, and you'd have to wait months for it to expire from their respective caches.
In my opinion, the best way to keep a low profile is to not publish links to the site anywhere else on the web. If the search bots don't know about your site, they can't index it. To reverse the damage already done, each search engine has its own process for getting pages removed from its index. You'll orobably have to do each one individually. Unfortunately, getting content out of search engines is simply not an easy process.
|
Kiril L.
May 9, 2007
01:22:22 PM
|
Well, lets just leave it the way it is. I have never linked to my Brandeis Webspace from anywhere else. I would put my name on a homework, and now if i Google my name, this homework comes up. As i understand it, this is the role of a bot, to crawl the web and search for data. Thanks for the interesting reading. I will look at it in depth after the finals are over :-)
|
Steven Karel
Administrator
May 9, 2007
01:29:47 PM
|
Are there professors who tell you to put homework solutions in your web folder? That seems like a bad place for them unless they are meant to be totally public. Hiding them from robots with robots.txt will not change the fact that they are readable by anyone with a UNet account.
|
Kiril L.
May 9, 2007
01:35:14 PM
|
For my CS65a (3D Animation class) we upload the homeworks. I mean this is not meant to be 'totally' private, as it is useful to review outer students projects in this class, but i do not want the whole world to be able to see my name associated with these projects.
|
David Wisniewski
Administrator
May 9, 2007
01:37:02 PM
|
Not that this helps you now, but... That's one of the (many) reasons Brandeis offers a course management system: collaboration amongst students in a class is a good thing, but making developmental and formative assignments available for the world to see for all of time is generally a bad thing.
|
Elliot Kendall
May 9, 2007
01:37:37 PM
|
I agree with Steven. Private content shouldn't be on the web at all, except behind a password. Once a search engine gets ahold of it, it's very hard to get the genie back into the bottle.
I couldn't figure out for sure where Google got your page from, but here's an interesting page. If this page came before Google indexed your site, then probably Google got it from there. This is also potentially a good example of the kind of malicious robots you have to worry about. The people who run medical-papers.com seem to be interested in ripping off people's content for advertising money. If they run their own spider, there's no reason they would want it to respect robots.txt.
|
Kiril L.
May 9, 2007
01:40:38 PM
|
right, that was my concern in the first place. i do not mind sharing useful info with people. However, even this post, would be available on google probably within a month. all my other tech support posts are Googleable. this is why i thought it would be good to have a robots file. but it makes it difficult for me to add this file, since Elliot has to do it, and also it make it difficult for me to remove already existing indexes, since i again have to be able to access the root.
|
Elliot Kendall
May 9, 2007
01:42:41 PM
|
Actually, I made a decent suggestion in my last post. Why not password-protect your homework files? You'd have to send a username and password to your professors/TAs to access it, but that shouldn't be unreasonable. It should also be reasonably effective to do that now and ask Google to remove your pages from their cache.
|
Kiril L.
May 9, 2007
01:48:53 PM
|
yes indeed, seems to be the best idea. and that is a very interesting discovery you found about medical-papers.com which does not even exist. (except in a cache). i don't even have a slightest clue why, my .css would be there. i guess their bots searched all .edu domains for hw5.css what would you recommend to be the best way to password protect the files, or even the whole folder? should i create a rar of my files?
|
David Wisniewski
Administrator
May 9, 2007
02:06:32 PM
|
htaccess will work very nicely.
|
Kiril L.
May 9, 2007
02:11:12 PM
|
great! I will try it before the end of the week. Thanks everyone.
|
Kiril L.
May 10, 2007
02:30:51 AM
|
I tried it, and it works great! I am going to look into the webmaster tools at Google to see how I could remove my pages from their index. Quick question, if now most of my webspace is protected by the .htaccess file, would Google consider my site as a source for indexing, even though they can not read it? (meaning, if say they have cached a file/folder and now they can not read it, would it be considered as 401 and be removed in like 6 months?) Again, thank you everyone for such useful responses.
|
Kiril L.
May 10, 2007
02:35:53 AM
|
PS: by 401 i meant 404
|
Jonathan Zornow
May 10, 2007
07:04:25 AM
|
Out of curiousity, what are everyone's thoughts about posting assignments and papers online, for all the world to see?
|
Kiril L.
May 10, 2007
08:08:56 AM
|
hmm, i would say only do that if the above is your intention and you do not fear people using your work without giving you credit. Also, you might have personal info in essays that perhaps you do not want revealed to anyone who googles your name or user id. Finally, future employers might find something they would pass judgment upon, you just never know. thats my take. Use good judgment on what needs to be private and what is a good resource to share.
|
Matthew Galinko
May 10, 2007
08:55:31 AM
|
Would it also work to use the .htaccess file to grant access to a directory only if the IP falls within Brandeis's range? As long as you don't mind it being accessible to anybody with a UNETID or some sort of "proxy" access to a machine on campus, you could avoid having to give out passwords.
|
Elliot Kendall
May 10, 2007
09:03:40 AM
|
Kiril: I'm not sure exactly how Google treats 401 vs 404 for indexing purposes. Once you get the pages out of the cache, though, you should be in good shape.
Jonathan: I might be worried about the possibility of someone stealing your work, submitting it as their own and potentially getting you in trouble as being complicit in cheating. That shouldn't happen if the authorities involved have a clue, but that's not always the case.
Depending on the quality of the work, it might actually improve your chances with employers. They're not just looking for bad stuff on the web to disqualify applicants, but also good stuff to make people stand out more.
Matthew: As Kiril points out, you can do that with htaccess files. In addition to the links he posted, you might also want to check out the full documentation courtesy of Apache.
|
Steven Karel
Administrator
May 10, 2007
09:05:30 AM
|
Matthew wrote:
Would it also work to use the .htaccess file to grant access to a directory only if the IP falls within Brandeis's range?
That should work also. I have not tried it on people.brandeis.edu though I use it in some directories on www.bio.brandeis.edu
Jonathan wrote:
What are everyone's thoughts about posting assignments and papers online
I'm also interested in what undergrads have to say about that. It can certainly be a fun and useful thing to do with class projects, see the Field Biology class website.
|
Post A Response
|