Have you ever searched for a common item on Amazon and found items that have strange names, descriptions, etc but have great reviews?
I sure have.
And in the rare instances where I've bought those things, or friends of mine have, they've basically been a lottery ticket in terms of what you actually get in the mail.
It seems, Amazon has an item reuse problem.
What's the problem?
Based on my experience, along with my friends, and even a couple Youtubers' explanations of their experience, it seems Amazon has a scourge of merchants that somehow trick the system into promoting their dubious-origin products.
They do this in a few ways, the most common way is to buy reviews, either off the website (buying spam reviews) or through the platform (paying/bribing customers to put great reviews). However, these seem to be caught pretty quickly by Amazon.
The next common way is to take products with lots of natural reviews (or with unnatural reviews that have passed scrutiny) and replace them wholesale-but-inline with another product.
Let's say you bought an SD card off Amazon that says it has 1TB of storage space. You get it home, plug it in, and your computer says it has 1TB of space. Great!
But after a couple days, you notice that the files or photos you put on it first are missing or corrupt, and you can only access the more recent files. Turns out, you have a 32GB SD card that's been firmware hacked to report more, so once you write more than 32GB you start losing data off the other side.
You go to Amazon to write a nasty review and notice there are thousands of positive, glowing reviews for this item. Scrolling down, you notice none of the positive ones say anything about an SD card. In fact, most of them talk about socks for their grandchildren.
But you bought a legit looking SD card! It had thousands of reviews, a boring description you didn't read, and a headline "1TB SD card for Digital Camera, Nintendo Switch, Samsung Galaxy, Valve Steamdeck, ..." You know, everything that takes an SD card.
Turns out, you got scammed
That item used to be socks for babies, or tupperware, or a garden hose, or literally anything else. It got great reviews, so the merchant (or a hacker of the merchant) decided it was high time to use that good will against the rest of you.
They then modified the item, replacing the title, description, photos, metadata, etc to all be a 1TB SD card.
Now it looks like a 1TB SD card with thousands of glowing reviews.
Okay, so what's the solution?
Simple, really, hash the product when it's first launched, and record that hash with every review. Once the hash diverges, hide the reviews for the old version, effectively resetting them.
It can't be that easy!
Okay, it's not, it's not just a single hash. And it's not a random hash.
I would use multiple hashes, one for each section of the product, and I would use a hashing algorithm that's actually not sensitive to small changes.
What hashes to store?
Everything the merchant can change independently of other things gets a hash. That's the title, description, photos, metadata, etc.
Then store the list of hashes in a "version table" for the product, and store a version number with each review. Whenever something is changed about the product, calculate new hashes, and update the version table to indicate which versions are still "compatible" with the latest version, based on the overlap and similarity of the hashes of each version with the latest hashes.
Amazon can even expose this to the merchant if they want to help them avoid getting items accidentally reset, though I would flag any item that flip-flops between compatible/incompatible and have that item or merchant investigated for trying to game the system.
How should the calculate these hashes?
In reality, Amazon should keep this part secret, otherwise merchants can use it to calculate "hash collisions" offline. They should only be able to find these online, where Amazon can track how often they do this and flag them for bad behavior.
With that said, what follows is an example algorithm that would catch flagrant bad behavior.
For text, we want to be insensitive to common edits, like typos, case changes, reordering certain words, and swapping out a single word at a time. It turns out, there's a whole field of research here (as there usually is), and what I would recommend is a Context-Triggered Piecewise Hash (CTPH), such as
What a CTPH does is generate a hash that is actually made up of multiple hashes of different pieces of the input. This means that changing a single word or two only alters a part of the hash, but changing most words will alter/replace the whole hash.
As an example, here's two product titles and their respective ssdeep hashes, and a more standard SHA-1 hash:
SanDisk 1TB Extreme microSDXC UHS-I Memory Card with Adapter - Up to 190MB/s, C10, U3, V30, 4K, 5K, A2, Micro SD Card- SDSQXAV-1T00-GN6MA: CPTH: RMWOuxainKtDuMRxqRVKWLm41sRjuJ5MoTkpZxv:RXO+ajRuv1s+M1xv SHA1: 19c31609a533c002ae0de55b67d64b309917a6d9 SanDisk 1TB Extreme microSDXC UHS-I Memory Card with Adapter - Up to 190MB/s, C10, U3, V30, 4K, 5K, 8K, A2, Micro SD Card- SDSQXAV-1T00-GN6MA: CPTH: RMWOuxainKtDuMRxqRVKWLm41sRjuJG5MoTkpZxv:RXO+ajRuv1sPM1xv SHA1: 33cda6716d43e187141c3ba238e6d074435ba64e
(I tried making this a table, but it didn't come out looking right.)
In this example, I've added "8K" in the second title, to reflect a normal change. The 2 CPTH hashes are basically identical except for a single added character (a G in the
juJ5M sequence to make it
juJG5M), while the SHA1 hash is entirely different.
This is a whole other realm of research, but the trendy thing today is to ask an AI to summarize a photo, then hash and compare the AI summaries. I mean, there are many ways to do this that don't require a GPU in production, but who am I to oppose trends? Honestly, I'm being a little lazy here; I'm sure there are better ways but I was mostly interested in the text areas and the photos can likely be left out entirely as it's tough to sell a "Baby socks, but actually an SD card" even if the photos show an SD card. So GPT-4 it is!
Why isn't this already done?
Same as for the previous "free solution" (Uber Eats Swapped Order problem), I don't think I'm smarter than anyone at Amazon, etc. This is just a (possible) solution to a problem that I see, and instead of writing a post just about a problem, I'd like to write about a solution. If it happens to be an interesting solution, then great, and if it's a viable solution, even better.
Having worked at a large company before, I know it takes a lot more than a viable solution and interest to make something happen, it takes a lot of dedicated effort to get the change through all the stakeholders (and anyone who has the power to stop you even if they're not a stakeholder). And if you don't get it done in the end due to some reason, then you're risking that amount of effort being seen as wasteful by the promotion committee, or whoever.
Anyway, I'm just a rando on the internet, so nobody needs to read this, listen to me, or do what I say. If you like, we can engage in improving the solution in the comments!