"A look at search engines with their own indexes"

- by Rohan Kumar (Seirdy)


On the back of the discussion about SearXNG, I found this article really interesting! I particularly like his evaluation of building new search engines:

When building webpages, authors need to consider the barriers to entry for a new search engine. The best engines we can build today shouldn’t replace Google. They should try to be different. We want to see the Web that Google won’t show us, and search engine diversity is an important step in that direction (§ Findings).

2 Likes

The thing about other search engines is most of them have no spam filtering. Google search is good at protecting users from spam and phishing sites as it detects malicious URLs and removes them to prevent users from clicking on them. A user on /r/privacyguides posted that DuckDuckGo doesn’t have the same level of filtering as when he searched for Github, it showed a fake, malicious link.

Take Wiby for example. Wiby is one of the worst search engines to use because it is powered by user-submitted sites, meaning anyone can submit a fake URL and there is nothing stopping anyone from clicking on a malicious URL. This should really be considered with search engines.

I think Privacy Guides should warn people about the dangers of using other search engines that don’t use Google’s search index. I’m not saying you shouldn’t use DuckDuckGo and I understand if you’re concerned about censorship, but for security-sensitive searches, you should only use Google or a privacy search engine that uses Google’s index, such as Brave search or Startpage.

3 Likes

The thing about other search engines is most of them have no spam filtering.

This isn’t true, but perhaps you mean that you think their spam filtering is worse? That’s fair enough.

This should really be considered with search engines.

I definitely agree. I think that’s partly why I found Seirdy’s article so interesting: because it’s intended to be a “living document” and gives a very broad overview of different search engines. Hopefully, it could be a helpful resource for making the kinds of those kinds of evaluations. :+1:

My favourite finds were Semantic Scholar, Teclis, and Kagi Search. I’m particularly keen to see how Kagi progresses!

Google search is good at protecting users from spam and phishing sites as it detects malicious URLs and removes them to prevent users from clicking on them.

This is a super old post, but I’d like to challenge this. Here’s an example of Google returning a malicious link as the first result when users search for Keepass, instead of the actual website.

This behavior is enabled by Google’s bizarre decision to not display punycodes in URLs, but worsened by Google’s inability to curtail malvertising. I’ve seen multiple articles from Ars about Google malvertising. They can’t seem to get a handle on this.

Compare this to another search engine, like Mojeek, which doesn’t index any URLs with punycodes in them at all:

Our approach is not to index them. We’ve discussed this a few times before but for now we’re focussing on the large quantity of non-punycode URLs.

That’s not to say that we’d never look at this area, but this is the first time a question has been asked about it, so it’s not a priority at this time.

I don’t have high confidence in Google’s results not to be spam or malware sites after knowing this.

Kagi is working on this: Filter out or mark punycode domains - Kagi Feedback

Also: Duckduckgo uses Bing’s index. The same criticism applies to all Bing-based search engines. Search Engine Map is a pretty good way of identifying these search engines.

1 Like

I see this argument made sometimes, not just for search engine results but also the more obvious one: browser URL bars. Now I’m not saying that the example here is not worrying, but nobody seems to consider people who actually use webpages that use punycode.

If I normally login to my bank/company at 銀行.com, then if I somehow would land on 鉝蠩.com instead that would be obviously the wrong page. But in case my browser only displays punycode, then xn--jn2ax2s.com looks awfully similar to xn--jm2ax2s.com and in general these punycoded versions are hard to tell apart from each other because they look a lot like random strings, not like something to remember or pronounce.

Therefore I don’t like the advice of “always display punycode, never the actual characters that are encoded”. Maybe we need something that allows the user to select what alphabets/writing systems they are capable of reading, and understand that this might be an important step in securing their computing. But we shouldn’t make assumptions that Latin characters is everything everyone ever needs. That assumption stops to work even if we don’t venture into Asia but just to Europe. xn--baw-joa.social is a relatively popular German Mastodon instance. Now for a German, if they would land on bamü.social instead of bawü.social, there’s this a fighting chance they notice the difference in the URL and for example don’t put in their account credentials there. But what if xn--bam-joa.social and xn--baw-joa.social is all they see? Chances are, even though it’s also just a character change from w to m and nothing else, they wouldn’t bother really looking at that long string with xn in front, as it doesn’t parse as having much meaning anyway to a mortal.

And as for what Mojeek is doing, yes this is foregoing the issue I guess. But even though there is certainly a “large quantity of non-punycode URLs”, just ignoring all other pages is not really a solution. There are also many completely valid punycode URLs with interesting content. I can understand the sentiment of somehow wanting to pull back the wheel of time and pretend punycode URLs are not a thing (and yeah, in hindsight that might be for the better), but here we are. At least when it comes to the ICANN root-based DNS, punycode is here to stay.

Maybe search engines are just not the holy grail that saves us all from malicious websites. I still think Google with their size could do a much better job at this stuff, and it’s especially bad that they’re kind of enabling it in the first place with how they sell advertising space to literal scammers. I think many of us need to grok at some point that web search is a service that you probably want to pay for, the same as you pay for privacy-respecting email, cloud storage etc. and not just falls from the sky.

3 Likes

Now I’m not saying that the example here is not worrying, but nobody seems to consider people who actually use webpages that use punycode.

As someone whose browsing activity does involve visiting some Japanese sites, I was not aware that Japanese sites actually used punycodes in their URLs. I can’t immediately think of any. I assumed this practice was similar for other languages, because no one ever brings them up in these discussions in my experience.

Mojeek doesn’t display them at all because it only indexes languages that use Latin characters, but they might re-evaluate that when they start indexing, say, Japanese webpages.

Firefox displays both the encoded characters and the punycode when you go into about:config and force the option on:

I think this UI is bad because you can easily miss the URL at the bottom (especially if you have autocomplete on), but it demonstrates that you can display both. For example, if a search engine detects that the domain uses a punycode, highlight it in some way to the user and display both of them. Educating the average user about punycodes in five seconds is a tall ask, but trying something is better than doing nothing at all. Unwitting Keepass users might have had a chance if the UI was better.

I think you’re right about not ignoring this altogether if there are actually legitimate websites out there using punycodes. I just wasn’t aware it was something used by anyone other than scammers (which is my bad for my ignorance, but where does one even find these examples?). The Mastodon instance you mentioned demonstrates that punycodes are in legitimate use.

I still think Google with their size could do a much better job at this stuff, and it’s especially bad that they’re kind of enabling it in the first place with how they sell advertising space to literal scammers.

I have seen far too many reports of malvertising on Google and personally witnessed malvertising on YouTube to trust Google with my security, which is why…

Maybe search engines are just not the holy grail that saves us all from malicious websites.

I think you’re completely right.

I think many of us need to grok at some point that web search is a service that you probably want to pay for, the same as you pay for privacy-respecting email, cloud storage etc. and not just falls from the sky.

Brave Search is about the only independent option you can pay for at the moment. According to the Kagi forum, they display punycodes. I pay for Kagi, but it’s only partially independent (mostly dependent on Google).

Mojeek aligns very strongly with many of my values, so I’d love to pay for a Mojeek that serves me Japanese results one day.

Unfortunately, Kagi was disqualified from PG precisely because it is paid, so I’m not sure a paid search engine will ever be recommended under the current guidelines.

It’s very doubtful, yeah. We can’t add a million things to the site, because then we have to keep track of all of them, and there really aren’t enough people working on reviewing things for PG to do that frankly.

So one of our main goals is to promote organizations who are making privacy tools as accessible as possible, and that means not including people who are paywalling privacy while plenty of their competitors offer the same thing for free. :slight_smile:

Not that I dislike Kagi still, but paying for ad-free search is merely a convenience, not a privacy benefit.

1 Like

To add to this, in my days off frequenting RuneScape, its always, and still is, a big issue with Google delivering phishing websites above legit ones.

1 Like

I don’t disagree with any of that. It would be nice to have good, privacy-respecting services for free.

I don’t pay for Kagi so I don’t need to see ads; I pay for better results. It’s a better search engine than every other service. It just so happens that ads make results worse. Plus it identifies paywalled articles in the results, searches the Internet Archive, surfaces interesting pages from smaller websites thanks to the Teclis index, lets me re-rank/block sites, lets me rewrite sites with regex, and works entirely without Javascript.

And none of that is directly privacy related, of course, but I just wanted to clarify why I think Kagi earns that price tag.

2 Likes

It’s of course an important aspect to consider how much punycode is actually used. After all, if really nobody uses it, then the argument would be much easier to make that you should just not render it, and maybe even block users from visiting websites that start with “xn–” altogether. After all, if nobody seriously uses it, it should be made obvious for everyone including the average user who has no idea what all of this even means and who doesn’t know the most basic things about the DNS, that this domain is out of the ordinary and most certainly not what they wanted to visit.

We seem to be in some kind of weird middle ground in regards to IDNs, where they are supported somehow, but also don’t really fully work as intended (as in, for example not being listed in search engines, not being correctly displayed in some UIs, etc.) and I don’t like it. The internet community should decide on whether they want them to be a first-class part of internet technologies, or if they’re gonna be thrown out altogether. Especially also when it comes to aspects such as security, having these half-baked solutions lingering about without people giving them much attention doesn’t seem wise. It’s the same as sysadmins who just try to ignore IPv6 despite probably every end user terminal and network device on their network supporting and most of them silently having it enabled. Whether they like it or not, in order to make sure their network is secure, they have to actively understand and manage IPv6, ignorance is not an option.

1 Like