Message boards :
Number crunching :
Problem with validation.
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Send message Joined: 14 Jun 23 Posts: 438 Credit: 280,293 RAC: 0 |
I wrote earlier: The server did not see them all and automatically reassigned them to recalculation. Yes, you wrote it, and I quoted it here https://boinc.termit.me/adsl/forum_thread.php?id=15&postid=376 But why The server did not see them all… ??? Do you have an explanation? |
Send message Joined: 14 Jun 23 Posts: 438 Credit: 280,293 RAC: 0 |
Some statistics for you: The most important statistic is missing here: 1048 "invalid" WUs. Demis can you provide these stats? This is one of those WUs Имя spt_101_5657288334525935453_0 https://boinc.termit.me/adsl/result.php?resultid=792447 |
Send message Joined: 14 Jun 23 Posts: 438 Credit: 280,293 RAC: 0 |
While I continue to look for reasons: why this is happening. Demis Give a counterexample to my version. You need to bring WU, which gave no tuples, except for 12-tuples and14-tuples, but she is not "invalid" and received credit. I think there is no such counterexample. |
Send message Joined: 14 Jun 23 Posts: 438 Credit: 280,293 RAC: 0 |
It will be very interesting if all 1048 "invalid" WU by fzs600 refer to my version! |
Send message Joined: 15 Jun 23 Posts: 22 Credit: 10,821,491 RAC: 261 |
It will be very interesting if all 1048 "invalid" WU by fzs600 refer to my version! Go through the top PCs, check their invalids that were returned around Aug 19th that were completed OK but have no file and verify. |
Send message Joined: 14 Jun 23 Posts: 438 Credit: 280,293 RAC: 0 |
Имя spt_101_5657288334525935453_0 This WU is calculated. Showing the console Поиск ассоциативных наборов простых 5:01:16 Текущий интервал: [5657289386525206972 ... 5657289388525206972] Проверено : 30% Скорость : 205 Найдено 12: 871 Найдено 13: 0 Найдено 14: 42 Найдено 15: 0 Найдено 16: 1 Найдено 17: 0 Найдено 18: 0 Найдено 19: 0 Найдено 20: 0 Найдено 21: 0 Найдено 22: 0 Найдено 23: 0 Найдено 24: 0 Найдено 25: 0 Найдено 26: 0 Найдено 27: 0 Найдено 28: 0 Найдено 29: 0 Найдено 30: 0 Найдено 31: 0 Найдено 32: 0 Найдено 33: 0 16-tuple found!! Demis We listen to your version. Why is this WU by fzs600 "invalid" and didn't get credit??? Let me remind you: This is a task from fzs600 |
Send message Joined: 14 Jun 23 Posts: 438 Credit: 280,293 RAC: 0 |
I tried to watch 16-tuples from the link https://boinc.termit.me/adsl/spt_list.php?k=16 Here is what was issued
Maybe this is the reason? 16-tuple cannot be written to the database ! |
Send message Joined: 14 Jun 23 Posts: 277 Credit: 4,254,800 RAC: 7,040 |
I tried to watch 16-tuples from the link No. |
Send message Joined: 15 Jun 23 Posts: 22 Credit: 10,821,491 RAC: 261 |
Top 20 computers only have this type of invalid in that period from Aug 19-20th. |
Send message Joined: 14 Jun 23 Posts: 277 Credit: 4,254,800 RAC: 7,040 |
Top 20 computers only have this type of invalid in that period from Aug 19-20th. Yes. I know. Because the cause was found and corrected. I am preparing an answer according to my version. |
Send message Joined: 14 Jun 23 Posts: 277 Credit: 4,254,800 RAC: 7,040 |
My version (part A). I will try to explain with one specific example from the link: https://boinc.termit.me/adsl/result.php?resultid=792447 What key information is important for us to analyze? Name spt_101_5657288334525935453_0 To answer briefly the question: "Why is this WU by fzs600 "invalid" and didn't get credit???" Then my answer would be: "Blame it spt_93_5188094934525935453_1" (From batch 93) WTF (you say). Well, at least that's what I told myself on Sunday. When I spent a whole day looking for a problem. To be continued... |
Send message Joined: 14 Jun 23 Posts: 277 Credit: 4,254,800 RAC: 7,040 |
My version (part B). However, everything is not so simple in reality. Let's find this wu (spt_101_5657288334525935453_0) in the assimilator log file. (I will abbreviate the number of lines) First: 2023-08-20 03:15:01 Started Second: 2023-08-20 03:20:01 Started What is important to look at? In first: result spt_101_5657288334525935453_0 result spt_101_5657418984525935453_0 (These two steps have been completed successfully.) In second: result spt_101_5657288334525935453_0 Invalid: Output file absent WTF!! WTF!! And also an abortion immediately after result spt_93_5188094934525935453_1 To be continued... |
Send message Joined: 14 Jun 23 Posts: 277 Credit: 4,254,800 RAC: 7,040 |
Has everyone bought popcorn? Then we continue: My version (part С). Studying the assimilator code, I found the exact place where exactly the error occurs. https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L401C1-L405C79 But to be more specific, this is the line: https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L404C4-L404C4 Metod readFile from CFileStream inbuf; This method is described in the class https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L58 on the line https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L63 And the error "File Not Found" (see Part B, 'EFileNotFound' what(): File Not Found) to us from line: https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L69 Why? See where the function ("EFileNotFound();") is located https://github.com/tomasbrod/tbboinc/blob/primes/bocom/Stream.cpp https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/bocom/Stream.cpp#L8 struct EFileNotFound : std::exception { const char * what () const noexcept {return "File Not Found";} }; Ok. While everything is logical. After call "return "File Not Found";" programm comback to point: https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L405 And in this moment happening "Aborted (core dumped)". The next function by code not work. What next function by code? https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L408 result_validate(result, inbuf, rstate); Therefore wu does not pass validation. It's just that the execution queue in the program does not reach it due to an error. This is the answer to the question "didn't get credit???" This was a long description, but now let's get back to the beginning. What do we know for sure now? File error spt_93_5188094934525935453.in But we knew it of course and from Part B: result spt_93_5188094934525935453_1 terminate called after throwing an instance of 'EFileNotFound' what(): File Not Found Aborted (core dumped) Perfect! We know from the code that the database entry is surrounded by a transaction. https://github.com/tomasbrod/tbboinc/blob/c125cc355b2dc9a1e536b5e5ded028d4e7f4613a/symprtu/asim.cpp#L640 Interrupting the program causes the database state to be rolled back to the beginning of the transaction. His means that other files from users, which at that time came from users, were not processed. Fine. Let's search and check this file: locate spt_93_5188094934525935453 /var/www/adsl/download/spt_93_5188094934525935453.in Fine. Let's check it another way: ls /var/www/adsl/download/spt_93_5188094934525935453.in ls: cannot access '/var/www/adsl/download/spt_93_5188094934525935453.in': No such file or directory What? Where is that damn file? After taking 30 minutes to rest, I remember that on Tuesday (15.08.2023) I was processing the directory "download"... I triple-checked that all issued wu were counted by users and made an archive of these files. Then I deleted those files. Thus, I am entirely to blame for this. And I apologize to the entire community for my gross mistake. I'm a very Sorry. It would seem that everything is simple. This moron Demis deleted something there and all users got problems... However, to be continued... There will be several more parts. And you need more popcorn... |
Send message Joined: 14 Jun 23 Posts: 438 Credit: 280,293 RAC: 0 |
And you need more popcorn... Popcorn won't help :))) |
Send message Joined: 14 Jun 23 Posts: 277 Credit: 4,254,800 RAC: 7,040 |
Before proceeding I want to ask a question What do you see strange from part B First: 2023-08-20 03:15:01 Started Second: 2023-08-20 03:20:01 Started ????? Am I the only one who got the idea? You don't have to answer right now. Just think... |
Send message Joined: 14 Jun 23 Posts: 438 Credit: 280,293 RAC: 0 |
Just think... I won't think. I already know what you will prove now. |
Send message Joined: 14 Jun 23 Posts: 277 Credit: 4,254,800 RAC: 7,040 |
I don't think you have the right guess. |
Send message Joined: 14 Jun 23 Posts: 277 Credit: 4,254,800 RAC: 7,040 |
My version (part D). When did I create archives and delete files? -rw-rw-r-- 1 boincadm boincadm 7229805 Aug 15 16:01 generated-file-in-87-2023-08-15_15-59-12.tar.gz -rw-rw-r-- 1 boincadm boincadm 27568 Aug 15 16:05 generated-file-in-88-2023-08-15_16-05-19.tar.gz -rw-rw-r-- 1 boincadm boincadm 54479 Aug 15 16:07 generated-file-in-89-2023-08-15_16-06-35.tar.gz -rw-rw-r-- 1 boincadm boincadm 107972 Aug 15 16:08 generated-file-in-91-2023-08-15_16-08-05.tar.gz -rw-rw-r-- 1 boincadm boincadm 214064 Aug 15 16:09 generated-file-in-93-2023-08-15_16-08-59.tar.gz -rw-rw-r-- 1 boincadm boincadm 426733 Aug 15 16:10 generated-file-in-95-2023-08-15_16-10-11.tar.gz -rw-rw-r-- 1 boincadm boincadm 846864 Aug 15 16:11 generated-file-in-97-2023-08-15_16-11-14.tar.gz Did I check for errors after archiving? Yes, Wednesday, Thursday and Friday. There were no mistakes. When did the first errors appear? 2023-08-18 23:55:01 Started ... result spt_99_5317922034525935453_1 result spt_99_5372471334525935453_1 result spt_97_5264965884525935453_1 terminate called after throwing an instance of 'EFileNotFound' what(): File Not Found Aborted (core dumped) That is at midnight Saturday. When the problem was already solved and everything worked as usual? Last error "File Not Found": 2023-08-20 22:52:04 Started Wiodb, name=spt_97_5251832634525935453_1_r1795014292_0,path=/var/www/adsl/upload/2f7/spt_97_52518326345 terminate called after throwing an instance of 'EFileNotFound' what(): File Not Found Last event "Output file absent": 2023-08-21 01:55:01 Started ... result spt_93_5199939234525935453_1 Invalid: Output file absent ... 2023-08-21 02:00:01 Started No any errors. 220 response files processed without errors. All other processing is also without errors. This part gives us a precise understanding of when it happened. And also answers the questions of those who say that no one solves server problems on the weekend ... Should I add "why did this happen on Saturday" if the files were deleted on Tuesday? My guess is: 1. I did not correctly interpret the completion of all the batches that I archived. I mean, I missed something. 2. Responses arrived from a client who took tasks a long time ago, counted them, but connected to the server only on Saturday night. To be continued... |
Send message Joined: 14 Jun 23 Posts: 438 Credit: 280,293 RAC: 0 |
I don't think you have the right guess. Let's see... I posted my guess. |
Send message Joined: 14 Jun 23 Posts: 277 Credit: 4,254,800 RAC: 7,040 |
My version (part E). We got so carried away with our analysis, discussing my operator error, that they have completely forgotten what exactly we see in part B. Let's go back to her. It was not by chance that I raised the question "What do you see strange from part B". I see three simple "questions or oddities" in front of me in Part B: 1. Who deleted the answer file spt_101_5657288334525935453_0_r375870431_0? 2. When did this happen? 3. Why did this happen? 2023-08-20 03:15:01 Started (this is +3 by local timezone, by UTC+0 is 00:15:01) ... result spt_101_5657288334525935453_0 <- at this time the file spt_101_5657288334525935453_0_r375870431_0 was still there. 2023-08-20 03:20:01 Started result spt_101_5657288334525935453_0 <- and at this time the file spt_101_5657288334525935453_0_r375870431_0 is gone. Invalid: Output file absent <- that's why we see this error. That is, between 03:15:01 and 03:20:01 the user's response file disappeared. It is very interesting! About the files that I deleted, we already know the information from part D. But the files from batch 101 were not deleted by me! How does processing user responses work in general? If we simplify it very much, then there is something like this sequence: 1. The response from the user takes file_upload_handler. 2. file_upload_handler sets several flags in special fields of the table in the database (the response file is accepted, the file needs to be assimilated, etc.) 3. When the assimilator starts, it "sees" a flag that "the file needs to be assimilated". 4. The assimilator does his job. 5. Sets several flags in the database (the response has been processed, the response file can be deleted, etc.). 6. When file_deleter is run, it sees the "response file can be deleted" flag, deletes the file and writes this event to the log file. The answer (partial) to point 1 (who) and to point 2 (when) is found: cat /var/www/adsl/log_boinc-server/file_deleter.log|grep "spt_101_5657288334525935453" 2023-08-20 03:15:12.3005 [RESULT#792447] unlinked /var/www/adsl/upload/61/spt_101_5657288334525935453_0_r375870431_0 That is, the file was deleted by the standard file_deleter utility, despite the fact that the transaction was canceled and the database should not have changed. Incredible!!! The thing is that file_deleter works on its own. And the data for deletion is taken from the database (the table has special fields that are responsible for this moment). And this means only one thing, that the "flag to delete" was set in the database despite the cancellation of the transaction due to an error in "First". Nonsense! How is that even possible? Such an event immediately raises an incredibly wide range of completely different questions. Is this a bug in our assimilator code? Is this a bug in the boink server code? Is this a mistake in the settings of the battle server? Is this a transactional model error? Is this a bug in MySQL? Is this a bug in MySQL settings? Is this a bug in the settings of the OS on which the boink-server is running? This error definitely does not occur in the case of data processing when "all the necessary files are there"? How can we trust the received data? And so on... There are more than a thousand questions... It probably takes more than one year to write them all. Therefore, item 3 now has no answer. And I just don't know what to do with it... That's all for now. P.S. And please remember: I don't have much free time to do this project. Therefore, while I am answering your questions, nothing else and more important is being done on the project. It's like this: "either we sit on the forum and blah blah blah" "or we do something necessary for the project." It is not possible at the same time. That's why I don't answer your questions very often. And in conclusion, I repeat, when I was asked a question on Monday, I answered it briefly "Yes" https://boinc.termit.me/adsl/forum_thread.php?id=15&postid=365 . But, in any case, your comments are welcome! https://www.jitbit.com/alexblog/203-what-if-drivers-were-hired-like-programmers/ |
©2024 Natalia Makarova & Alex Belyshev & Tomáš Brada