Problem with validation.

Message boards : Number crunching : Problem with validation.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Natalia Makarova
Volunteer moderator
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 14 Jun 23
Posts: 438
Credit: 280,293
RAC: 0
Message 382 - Posted: 22 Aug 2023, 17:37:05 UTC - in response to Message 380.  
Last modified: 22 Aug 2023, 17:38:26 UTC

I wrote earlier: The server did not see them all and automatically reassigned them to recalculation.

Yes, you wrote it, and I quoted it here
https://boinc.termit.me/adsl/forum_thread.php?id=15&postid=376

But why
The server did not see them all…

???

Do you have an explanation?
ID: 382 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Natalia Makarova
Volunteer moderator
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 14 Jun 23
Posts: 438
Credit: 280,293
RAC: 0
Message 383 - Posted: 22 Aug 2023, 17:44:34 UTC
Last modified: 22 Aug 2023, 17:55:26 UTC

Some statistics for you:
+----------------------------+--------+
| Count_Task_Assigned_Credit | Name   |
+----------------------------+--------+
|                      13605 | fzs600 |
+----------------------------+--------+

+-----------------------------+--------+
| Count_Task_Canceled_by_User | Name   |
+-----------------------------+--------+
|                        6692 | fzs600 |
+-----------------------------+--------+

The most important statistic is missing here: 1048 "invalid" WUs.

Demis
can you provide these stats?

This is one of those WUs

Имя spt_101_5657288334525935453_0

Результат выполнения Ошибка проверки

Состояние проверки Неправильный

Очки 0.00

https://boinc.termit.me/adsl/result.php?resultid=792447
ID: 383 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Natalia Makarova
Volunteer moderator
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 14 Jun 23
Posts: 438
Credit: 280,293
RAC: 0
Message 384 - Posted: 22 Aug 2023, 17:52:29 UTC - in response to Message 381.  
Last modified: 22 Aug 2023, 17:56:42 UTC

While I continue to look for reasons: why this is happening.
I have my own version.
And I'll check it out.
And if it is confirmed, then everything is much simpler.

Demis
Give a counterexample to my version.
You need to bring WU, which gave no tuples, except for 12-tuples and14-tuples, but she is not "invalid" and received credit.

I think there is no such counterexample.
ID: 384 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Natalia Makarova
Volunteer moderator
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 14 Jun 23
Posts: 438
Credit: 280,293
RAC: 0
Message 385 - Posted: 22 Aug 2023, 17:59:58 UTC
Last modified: 22 Aug 2023, 18:00:19 UTC

It will be very interesting if all 1048 "invalid" WU by fzs600 refer to my version!
ID: 385 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 15 Jun 23
Posts: 22
Credit: 10,821,491
RAC: 261
Message 386 - Posted: 22 Aug 2023, 18:21:06 UTC - in response to Message 385.  

It will be very interesting if all 1048 "invalid" WU by fzs600 refer to my version!


Go through the top PCs, check their invalids that were returned around Aug 19th that were completed OK but have no file and verify.
ID: 386 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Natalia Makarova
Volunteer moderator
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 14 Jun 23
Posts: 438
Credit: 280,293
RAC: 0
Message 389 - Posted: 23 Aug 2023, 6:08:41 UTC
Last modified: 23 Aug 2023, 6:11:07 UTC

Имя spt_101_5657288334525935453_0

This WU is calculated.

Showing the console

Поиск ассоциативных наборов простых              5:01:16
Текущий интервал: [5657289386525206972 ... 5657289388525206972]
Проверено :     30%
Скорость  :    205
Найдено 12:    871
Найдено 13:      0
Найдено 14:     42
Найдено 15:      0
Найдено 16:      1
Найдено 17:      0
Найдено 18:      0
Найдено 19:      0
Найдено 20:      0
Найдено 21:      0
Найдено 22:      0
Найдено 23:      0
Найдено 24:      0
Найдено 25:      0
Найдено 26:      0
Найдено 27:      0
Найдено 28:      0
Найдено 29:      0
Найдено 30:      0
Найдено 31:      0
Найдено 32:      0
Найдено 33:      0

16-tuple found!!

Demis
We listen to your version.
Why is this WU by fzs600 "invalid" and didn't get credit???

Let me remind you:

This is a task from fzs600
https://boinc.termit.me/adsl/result.php?resultid=792447

Имя spt_101_5657288334525935453_0

Результат выполнения Ошибка проверки

Состояние проверки Неправильный

Очки 0.00
ID: 389 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Natalia Makarova
Volunteer moderator
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 14 Jun 23
Posts: 438
Credit: 280,293
RAC: 0
Message 390 - Posted: 23 Aug 2023, 6:20:43 UTC
Last modified: 23 Aug 2023, 6:21:38 UTC

I tried to watch 16-tuples from the link
https://boinc.termit.me/adsl/spt_list.php?k=16

Here is what was issued



Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 68364032 bytes) in /var/www/adsl/html/inc/db_conn.inc on line 69

Maybe this is the reason?
16-tuple cannot be written to the database !
ID: 390 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Demis
Project tester
Volunteer developer
Volunteer tester

Send message
Joined: 14 Jun 23
Posts: 277
Credit: 4,252,273
RAC: 6,931
Message 391 - Posted: 23 Aug 2023, 21:30:04 UTC - in response to Message 390.  

I tried to watch 16-tuples from the link
https://boinc.termit.me/adsl/spt_list.php?k=16

Here is what was issued

<br />
<b>Fatal error</b>: Allowed memory size of 134217728 bytes exhausted (tried to allocate 68364032 bytes) in <b>/var/www/adsl/html/inc/db_conn.inc</b> on line <b>69</b><br />

Maybe this is the reason?
16-tuple cannot be written to the database !

No.
ID: 391 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 15 Jun 23
Posts: 22
Credit: 10,821,491
RAC: 261
Message 392 - Posted: 23 Aug 2023, 21:35:04 UTC

Top 20 computers only have this type of invalid in that period from Aug 19-20th.
ID: 392 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Demis
Project tester
Volunteer developer
Volunteer tester

Send message
Joined: 14 Jun 23
Posts: 277
Credit: 4,252,273
RAC: 6,931
Message 393 - Posted: 23 Aug 2023, 22:08:45 UTC - in response to Message 392.  

Top 20 computers only have this type of invalid in that period from Aug 19-20th.

Yes.
I know.
Because the cause was found and corrected.
I am preparing an answer according to my version.
ID: 393 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Demis
Project tester
Volunteer developer
Volunteer tester

Send message
Joined: 14 Jun 23
Posts: 277
Credit: 4,252,273
RAC: 6,931
Message 394 - Posted: 23 Aug 2023, 22:44:51 UTC

My version (part A).
I will try to explain with one specific example from the link:
https://boinc.termit.me/adsl/result.php?resultid=792447

What key information is important for us to analyze?
Name spt_101_5657288334525935453_0
Workunit 512540 (https://boinc.termit.me/adsl/workunit.php?wuid=512540)
Report deadline 24 Aug 2023, 8:47:46 UTC
Received 20 Aug 2023, 0:10:46 UTC
Validate state Invalid

Stderr output:
...
...
Validator: Output file absent


To answer briefly the question: "Why is this WU by fzs600 "invalid" and didn't get credit???"
Then my answer would be: "Blame it spt_93_5188094934525935453_1" (From batch 93)

WTF (you say).

Well, at least that's what I told myself on Sunday. When I spent a whole day looking for a problem.
To be continued...
ID: 394 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Demis
Project tester
Volunteer developer
Volunteer tester

Send message
Joined: 14 Jun 23
Posts: 277
Credit: 4,252,273
RAC: 6,931
Message 396 - Posted: 23 Aug 2023, 23:30:30 UTC

My version (part B).
However, everything is not so simple in reality.

Let's find this wu (spt_101_5657288334525935453_0) in the assimilator log file.
(I will abbreviate the number of lines)
First:
2023-08-20 03:15:01 Started
(this is +3 by local timezone)
...
result spt_101_5657288334525935453_0
result spt_101_5657418984525935453_0
result spt_101_5657582784525935453_0
Invalid: Output file absent
...
result spt_101_5475179784525935453_1
result spt_101_5475378684525935453_1
result spt_93_5188094934525935453_1
terminate called after throwing an instance of 'EFileNotFound'
what(): File Not Found
Aborted (core dumped)

Second:
2023-08-20 03:20:01 Started
...
result spt_101_5657288334525935453_0
Invalid: Output file absent
...
result spt_101_5475179784525935453_1
result spt_101_5475378684525935453_1
result spt_93_5188094934525935453_1
terminate called after throwing an instance of 'EFileNotFound'
what(): File Not Found
Aborted (core dumped)

What is important to look at?

In first:
result spt_101_5657288334525935453_0
result spt_101_5657418984525935453_0
(These two steps have been completed successfully.)

In second:
result spt_101_5657288334525935453_0
Invalid: Output file absent

WTF!!
WTF!!

And also an abortion immediately after result spt_93_5188094934525935453_1
To be continued...
ID: 396 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Demis
Project tester
Volunteer developer
Volunteer tester

Send message
Joined: 14 Jun 23
Posts: 277
Credit: 4,252,273
RAC: 6,931
Message 399 - Posted: 24 Aug 2023, 16:00:54 UTC

Has everyone bought popcorn? Then we continue:
My version (part С).

Studying the assimilator code, I found the exact place where exactly the error occurs.
https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L401C1-L405C79
But to be more specific, this is the line:
https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L404C4-L404C4

Metod readFile from CFileStream inbuf;

This method is described in the class https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L58
on the line
https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L63

And the error "File Not Found" (see Part B, 'EFileNotFound' what(): File Not Found) to us from line:
https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L69

Why?
See where the function ("EFileNotFound();") is located
https://github.com/tomasbrod/tbboinc/blob/primes/bocom/Stream.cpp
https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/bocom/Stream.cpp#L8
struct EFileNotFound	: std::exception { const char * what () const noexcept {return "File Not Found";} };

Ok.
While everything is logical.

After call "return "File Not Found";" programm comback to point:
https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L405

And in this moment happening "Aborted (core dumped)".
The next function by code not work.
What next function by code?
https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L408
result_validate(result, inbuf, rstate);
Therefore wu does not pass validation.

It's just that the execution queue in the program does not reach it due to an error.
This is the answer to the question "didn't get credit???"

This was a long description, but now let's get back to the beginning.

What do we know for sure now?
File error spt_93_5188094934525935453.in

But we knew it of course and from Part B:
result spt_93_5188094934525935453_1
terminate called after throwing an instance of 'EFileNotFound'
  what():  File Not Found
Aborted (core dumped)

Perfect!
We know from the code that the database entry is surrounded by a transaction.
https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L640

Interrupting the program causes the database state to be rolled back to the beginning of the transaction.
His means that other files from users, which at that time came from users, were not processed.

Fine. Let's search and check this file:
locate spt_93_5188094934525935453
/var/www/adsl/download/spt_93_5188094934525935453.in

Fine.

Let's check it another way:
ls /var/www/adsl/download/spt_93_5188094934525935453.in
ls: cannot access '/var/www/adsl/download/spt_93_5188094934525935453.in': No such file or directory

What?
Where is that damn file?

After taking 30 minutes to rest, I remember that on Tuesday (15.08.2023) I was processing the directory "download"...

I triple-checked that all issued wu were counted by users and made an archive of these files.
Then I deleted those files.

Thus, I am entirely to blame for this.
And I apologize to the entire community for my gross mistake.
I'm a very Sorry.

It would seem that everything is simple.
This moron Demis deleted something there and all users got problems...

However, to be continued...
There will be several more parts.
And you need more popcorn...
ID: 399 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Natalia Makarova
Volunteer moderator
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 14 Jun 23
Posts: 438
Credit: 280,293
RAC: 0
Message 400 - Posted: 24 Aug 2023, 16:31:22 UTC - in response to Message 399.  

And you need more popcorn...

Popcorn won't help :)))
ID: 400 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Demis
Project tester
Volunteer developer
Volunteer tester

Send message
Joined: 14 Jun 23
Posts: 277
Credit: 4,252,273
RAC: 6,931
Message 401 - Posted: 24 Aug 2023, 17:53:59 UTC

Before proceeding I want to ask a question

What do you see strange from part B
First:
2023-08-20 03:15:01 Started

Second:
2023-08-20 03:20:01 Started

?????

Am I the only one who got the idea?

You don't have to answer right now.
Just think...
ID: 401 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Natalia Makarova
Volunteer moderator
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 14 Jun 23
Posts: 438
Credit: 280,293
RAC: 0
Message 402 - Posted: 24 Aug 2023, 18:17:55 UTC - in response to Message 401.  

Just think...

I won't think.

I already know what you will prove now.
ID: 402 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Demis
Project tester
Volunteer developer
Volunteer tester

Send message
Joined: 14 Jun 23
Posts: 277
Credit: 4,252,273
RAC: 6,931
Message 403 - Posted: 24 Aug 2023, 18:43:47 UTC - in response to Message 402.  

I don't think you have the right guess.
ID: 403 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Demis
Project tester
Volunteer developer
Volunteer tester

Send message
Joined: 14 Jun 23
Posts: 277
Credit: 4,252,273
RAC: 6,931
Message 404 - Posted: 24 Aug 2023, 18:51:03 UTC

My version (part D).

When did I create archives and delete files?
-rw-rw-r-- 1 boincadm boincadm  7229805 Aug 15 16:01 generated-file-in-87-2023-08-15_15-59-12.tar.gz
-rw-rw-r-- 1 boincadm boincadm    27568 Aug 15 16:05 generated-file-in-88-2023-08-15_16-05-19.tar.gz
-rw-rw-r-- 1 boincadm boincadm    54479 Aug 15 16:07 generated-file-in-89-2023-08-15_16-06-35.tar.gz
-rw-rw-r-- 1 boincadm boincadm   107972 Aug 15 16:08 generated-file-in-91-2023-08-15_16-08-05.tar.gz
-rw-rw-r-- 1 boincadm boincadm   214064 Aug 15 16:09 generated-file-in-93-2023-08-15_16-08-59.tar.gz
-rw-rw-r-- 1 boincadm boincadm   426733 Aug 15 16:10 generated-file-in-95-2023-08-15_16-10-11.tar.gz
-rw-rw-r-- 1 boincadm boincadm   846864 Aug 15 16:11 generated-file-in-97-2023-08-15_16-11-14.tar.gz

Did I check for errors after archiving?
Yes, Wednesday, Thursday and Friday.

There were no mistakes.

When did the first errors appear?
2023-08-18 23:55:01 Started
...
result spt_99_5317922034525935453_1
result spt_99_5372471334525935453_1
result spt_97_5264965884525935453_1
terminate called after throwing an instance of 'EFileNotFound'
  what():  File Not Found
Aborted (core dumped)

That is at midnight Saturday.

When the problem was already solved and everything worked as usual?
Last error "File Not Found":
2023-08-20 22:52:04 Started
Wiodb, name=spt_97_5251832634525935453_1_r1795014292_0,path=/var/www/adsl/upload/2f7/spt_97_52518326345
terminate called after throwing an instance of 'EFileNotFound'
  what():  File Not Found

Last event "Output file absent":
2023-08-21 01:55:01 Started
...
result spt_93_5199939234525935453_1
 Invalid: Output file absent
...


2023-08-21 02:00:01 Started

No any errors.
220 response files processed without errors.
All other processing is also without errors.

This part gives us a precise understanding of when it happened.
And also answers the questions of those who say that no one solves server problems on the weekend ...

Should I add "why did this happen on Saturday" if the files were deleted on Tuesday?

My guess is:
1. I did not correctly interpret the completion of all the batches that I archived. I mean, I missed something.
2. Responses arrived from a client who took tasks a long time ago, counted them, but connected to the server only on Saturday night.

To be continued...
ID: 404 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Natalia Makarova
Volunteer moderator
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 14 Jun 23
Posts: 438
Credit: 280,293
RAC: 0
Message 405 - Posted: 24 Aug 2023, 19:58:36 UTC - in response to Message 403.  
Last modified: 24 Aug 2023, 19:58:47 UTC

I don't think you have the right guess.

Let's see...
I posted my guess.
ID: 405 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Demis
Project tester
Volunteer developer
Volunteer tester

Send message
Joined: 14 Jun 23
Posts: 277
Credit: 4,252,273
RAC: 6,931
Message 407 - Posted: 25 Aug 2023, 7:48:31 UTC

My version (part E).

We got so carried away with our analysis, discussing my operator error,
that they have completely forgotten what exactly we see in part B.
Let's go back to her.

It was not by chance that I raised the question "What do you see strange from part B".
I see three simple "questions or oddities" in front of me in Part B:
1. Who deleted the answer file spt_101_5657288334525935453_0_r375870431_0?
2. When did this happen?
3. Why did this happen?
2023-08-20 03:15:01 Started
 (this is +3 by local timezone, by UTC+0 is 00:15:01)
...
result spt_101_5657288334525935453_0 <- at this time the file spt_101_5657288334525935453_0_r375870431_0 was still there.

2023-08-20 03:20:01 Started
result spt_101_5657288334525935453_0 <- and at this time the file spt_101_5657288334525935453_0_r375870431_0 is gone.

  Invalid: Output file absent <- that's why we see this error.

That is, between 03:15:01 and 03:20:01 the user's response file disappeared.
It is very interesting!

About the files that I deleted, we already know the information from part D.
But the files from batch 101 were not deleted by me!

How does processing user responses work in general?
If we simplify it very much, then there is something like this sequence:
1. The response from the user takes file_upload_handler.
2. file_upload_handler sets several flags in special fields of the table in the database (the response file is accepted, the file needs to be assimilated, etc.)
3. When the assimilator starts, it "sees" a flag that "the file needs to be assimilated".
4. The assimilator does his job.
5. Sets several flags in the database (the response has been processed, the response file can be deleted, etc.).
6. When file_deleter is run, it sees the "response file can be deleted" flag, deletes the file and writes this event to the log file.

The answer (partial) to point 1 (who) and to point 2 (when) is found:
cat /var/www/adsl/log_boinc-server/file_deleter.log|grep "spt_101_5657288334525935453"
2023-08-20 03:15:12.3005 [RESULT#792447] unlinked /var/www/adsl/upload/61/spt_101_5657288334525935453_0_r375870431_0


That is, the file was deleted by the standard file_deleter utility, despite the fact that the transaction was canceled and the database should not have changed.
Incredible!!!

The thing is that file_deleter works on its own.
And the data for deletion is taken from the database (the table has special fields that are responsible for this moment).

And this means only one thing, that the "flag to delete" was set in the database despite the cancellation of the transaction due to an error in "First".
Nonsense!
How is that even possible?

Such an event immediately raises an incredibly wide range of completely different questions.
Is this a bug in our assimilator code?
Is this a bug in the boink server code?
Is this a mistake in the settings of the battle server?
Is this a transactional model error?
Is this a bug in MySQL?
Is this a bug in MySQL settings?
Is this a bug in the settings of the OS on which the boink-server is running?
This error definitely does not occur in the case of data processing when "all the necessary files are there"?
How can we trust the received data?
And so on...

There are more than a thousand questions...
It probably takes more than one year to write them all.

Therefore, item 3 now has no answer.
And I just don't know what to do with it...

That's all for now.

P.S.
And please remember: I don't have much free time to do this project.
Therefore, while I am answering your questions, nothing else and more important is being done on the project.
It's like this: "either we sit on the forum and blah blah blah" "or we do something necessary for the project."
It is not possible at the same time.
That's why I don't answer your questions very often.

And in conclusion, I repeat, when I was asked a question on Monday, I answered it briefly "Yes"
https://boinc.termit.me/adsl/forum_thread.php?id=15&postid=365 .

But, in any case, your comments are welcome!

https://www.jitbit.com/alexblog/203-what-if-drivers-were-hired-like-programmers/
ID: 407 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : Problem with validation.

©2024 Natalia Makarova & Alex Belyshev & Tomáš Brada